MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions at EXPLAIN EXTENDED

MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Magerio asks:

Hello, thanks for your great post about advanced row sampling, it helped me a lot.

I have a question.

I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).

What if I want to show the best 3 products of each category?

Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER() in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.

The original article described a query to obtain 15 first id's for each grouper. The idea behind this query was quite simple:

Take a list of groupers using DISTINCT
For each grouper, find the 15th id using LIMIT in a correlated subquery. We have now a set of (grouper, id15) pairs: for any given grouper we have the id of the 15th row (ordered by id).
Join this resultset with the main table so that for each grouper, the id's less or equal to the 15th id are selected. We then will have 15 lowest id's for each grouper that is exactly what we were after.

This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id): a function with a single PARTITION BY expression and a single ORDER BY expression.

But in your case there are multiple ORDER BY expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality). However, the solution above relied on the fact that the field in ORDER BY was unique, so we now that there are exactly 15 rows with id less or equal to the id returned with LIMIT 15. Since this is most probably not the case with quality, we should include id into the ORDER BY clause so that it makes the ordering conidition unique.

A subquery in MySQL, however, can only return a single value.

To work around this, we should use two subqueries that will return us a single column in an ORDER BY condition.

Let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_portfolio (
id INT NOT NULL PRIMARY KEY,
category INT NOT NULL,
quality INT NOT NULL,
value VARCHAR(100) NOT NULL,
KEY ix_portfolio_category_quality (category, quality)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_portfolio
SELECT  id, id % 5 + 1, FLOOR(RAND(20090914) * 10000) + 1, CONCAT('Portfolio ', id)
FROM    filler;

There are 100,000 rows with 5 categories and random qualities from 1 to 10,000 (lower is better).

There is a composite index on (category, quality). Note that id is not a part of the index: since out table is InnoDB and id is a PRIMARY KEY, it is implicitly included as a trailing column of any index. There is no need to inlcude it explicitly.

We should select 3 best rows for each category.

To do this, we should build a list of distinct categories and select the quality and id of the third row for each category:

View the query

SELECT  category,
COALESCE
(
(
SELECT  quality
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mquality,
COALESCE
(
(
SELECT  id
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mid
FROM    (
SELECT  DISTINCT category
FROM    t_portfolio
) pd

category	mquality	mid
1	2	6480
2	1	83081
3	2	35832
4	3	35578
5	7	49734
5 rows fetched in 0.0002s (0.0021s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived4>	ALL					5	100.00
4	DERIVED	t_portfolio	range		ix_portfolio_category_quality	4		5	100.00	Using index for group-by
3	DEPENDENT SUBQUERY	pi	ref	ix_portfolio_category_quality	ix_portfolio_category_quality	4	pd.category	25024	100.00	Using where; Using index
2	DEPENDENT SUBQUERY	pi	ref	ix_portfolio_category_quality	ix_portfolio_category_quality	4	pd.category	25024	100.00	Using where; Using index

As we can see, this query is instant: MySQL is very good in optimizing DISTINCT on an indexed field as well as in optimizing ORDER BY with a LIMIT on an indexed condition.

Now we need to join this resultset with the original table so that all (quality, id) pairs below the upper threshold are returned.

MySQL is not particularly good in optimizing the range conditions in a join. That's why in the original article (Advanced row sampling) we used a little trick to force MySQL use Range checked for each record.

To do that, we had to rewrite the JOIN condition on category as a pair of two weak inequalities:

p.category <= po.category AND p.category >= po.category

It's easy to see that it's logically equivalent to an equality condition:

p.category = po.category

, however the former syntax forces MySQL to do the range optimization on each nested loop.

We could probably use this trick in the new query too and join the table on the following condition:

p.category <= po.category AND p.category >= po.category AND (p.quality, p.id) <= (po.mquality, po.mid)

View the query details

SELECT  p.*
FROM    (
SELECT  category,
COALESCE
(
(
SELECT  quality
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mquality,
COALESCE
(
(
SELECT  id
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mid
FROM    (
SELECT  DISTINCT category
FROM    t_portfolio
) pd
) po
JOIN    t_portfolio p
ON      p.category &lt;= po.category
        AND p.category &gt;= po.category
AND (p.quality, p.id) &lt;= (po.mquality, po.mid)

id	category	quality	value
6480	1	2	Portfolio 6480
31870	1	1	Portfolio 31870
55455	1	1	Portfolio 55455
1351	2	1	Portfolio 1351
65171	2	1	Portfolio 65171
83081	2	1	Portfolio 83081
12647	3	1	Portfolio 12647
35832	3	2	Portfolio 35832
56042	3	1	Portfolio 56042
973	4	2	Portfolio 973
6608	4	2	Portfolio 6608
35578	4	3	Portfolio 35578
49734	5	7	Portfolio 49734
65394	5	4	Portfolio 65394
96459	5	5	Portfolio 96459
15 rows fetched in 0.0006s (0.5468s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					5	100.00
1	PRIMARY	p	ALL	ix_portfolio_category_quality				100096	100.00	Range checked for each record (index map: 0x2)
2	DERIVED	<derived5>	ALL					5	100.00
5	DERIVED	t_portfolio	range		ix_portfolio_category_quality	4		5	100.00	Using index for group-by
4	DEPENDENT SUBQUERY	pi	ref	ix_portfolio_category_quality	ix_portfolio_category_quality	4	pd.category	25024	100.00	Using where; Using index
3	DEPENDENT SUBQUERY	pi	ref	ix_portfolio_category_quality	ix_portfolio_category_quality	4	pd.category	25024	100.00	Using where; Using index

However, in this case the query is by far not as efficient as the original one (it takes more than 0.5 seconds).

MySQL for some reason cannot efficiently optimize such a range. To help it a little we should include an additional condition into the ON clause of the JOIN:

p.category <= po.category AND p.category >= po.category AND p.quality <= po.mquality AND (p.quality, p.id) <= (po.mquality, po.mid)

The third condition (p.quality <= po.quality) is logically redundant, since (p.quality, p.id) <= (po.mquality, po.mid) implies it. But it will help MySQL to build more efficient range condition on the index.

The final query looks like this:

SELECT  p.*
FROM    (
SELECT  category,
COALESCE
(
(
SELECT  quality
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mquality,
COALESCE
(
(
SELECT  id
FROM    t_portfolio pi
WHERE   pi.category = pd.category
ORDER BY
category, quality, id
LIMIT 2, 1
),
CAST(0xFFFFFFFF AS DECIMAL)
) AS mid
FROM    (
SELECT  DISTINCT category
FROM    t_portfolio
) pd
) po
JOIN    t_portfolio p
ON      p.category &lt;= po.category
        AND p.category &gt;= po.category
AND p.quality &lt;= po.mquality
        AND (p.quality, p.id) &lt;= (po.mquality, po.mid)

View query results and execution plan

id	category	quality	value
31870	1	1	Portfolio 31870
55455	1	1	Portfolio 55455
6480	1	2	Portfolio 6480
1351	2	1	Portfolio 1351
65171	2	1	Portfolio 65171
83081	2	1	Portfolio 83081
12647	3	1	Portfolio 12647
56042	3	1	Portfolio 56042
35832	3	2	Portfolio 35832
973	4	2	Portfolio 973
6608	4	2	Portfolio 6608
35578	4	3	Portfolio 35578
65394	5	4	Portfolio 65394
96459	5	5	Portfolio 96459
49734	5	7	Portfolio 49734
15 rows fetched in 0.0006s (0.0037s)

We see that including a logically redundant condition into the ON clause of the JOIN made MySQL build a more efficient range for the index and reduced query time from 548 ms to 4 ms.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 14th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED