MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Magerio asks:
Hello, thanks for your great post about advanced row sampling, it helped me a lot.
I have a question.
I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).
What if I want to show the best 3 products of each category?
Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER()
in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.
The original article described a query to obtain 15 first id
's for each grouper
. The idea behind this query was quite simple:
- Take a list of groupers using
DISTINCT
- For each grouper, find the 15th
id
usingLIMIT
in a correlated subquery. We have now a set of(grouper, id15)
pairs: for any givengrouper
we have theid
of the 15th row (ordered byid
). - Join this resultset with the main table so that for each
grouper
, theid
's less or equal to the 15thid
are selected. We then will have 15 lowestid
's for eachgrouper
that is exactly what we were after.
This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)
: a function with a single PARTITION BY
expression and a single ORDER BY
expression.
But in your case there are multiple ORDER BY
expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality
). However, the solution above relied on the fact that the field in ORDER BY
was unique, so we now that there are exactly 15 rows with id
less or equal to the id
returned with LIMIT 15
. Since this is most probably not the case with quality, we should include id
into the ORDER BY
clause so that it makes the ordering conidition unique.
A subquery in MySQL, however, can only return a single value.
To work around this, we should use two subqueries that will return us a single column in an ORDER BY
condition.
Let's create a sample table:
Table creation details
There are 100,000 rows with 5 categories and random qualities from 1 to 10,000 (lower is better).
There is a composite index on (category, quality)
. Note that id
is not a part of the index: since out table is InnoDB
and id
is a PRIMARY KEY
, it is implicitly included as a trailing column of any index. There is no need to inlcude it explicitly.
We should select 3 best rows for each category.
To do this, we should build a list of distinct categories and select the quality
and id
of the third row for each category:
As we can see, this query is instant: MySQL is very good in optimizing DISTINCT
on an indexed field as well as in optimizing ORDER BY
with a LIMIT
on an indexed condition.
Now we need to join this resultset with the original table so that all (quality, id)
pairs below the upper threshold are returned.
MySQL is not particularly good in optimizing the range conditions in a join. That's why in the original article (Advanced row sampling) we used a little trick to force MySQL use Range checked for each record
.
To do that, we had to rewrite the JOIN
condition on category
as a pair of two weak inequalities:
p.category <= po.category AND p.category >= po.category
It's easy to see that it's logically equivalent to an equality condition:
p.category = po.category
, however the former syntax forces MySQL to do the range optimization on each nested loop.
We could probably use this trick in the new query too and join the table on the following condition:
p.category <= po.category
AND p.category >= po.category
AND (p.quality, p.id) <= (po.mquality, po.mid)
However, in this case the query is by far not as efficient as the original one (it takes more than 0.5 seconds).
MySQL for some reason cannot efficiently optimize such a range. To help it a little we should include an additional condition into the ON
clause of the JOIN
:
p.category <= po.category
AND p.category >= po.category
AND p.quality <= po.mquality
AND (p.quality, p.id) <= (po.mquality, po.mid)
The third condition (p.quality <= po.quality
) is logically redundant, since (p.quality, p.id) <= (po.mquality, po.mid)
implies it. But it will help MySQL to build more efficient range condition on the index.
The final query looks like this:
SELECT p.* FROM ( SELECT category, COALESCE ( ( SELECT quality FROM t_portfolio pi WHERE pi.category = pd.category ORDER BY category, quality, id LIMIT 2, 1 ), CAST(0xFFFFFFFF AS DECIMAL) ) AS mquality, COALESCE ( ( SELECT id FROM t_portfolio pi WHERE pi.category = pd.category ORDER BY category, quality, id LIMIT 2, 1 ), CAST(0xFFFFFFFF AS DECIMAL) ) AS mid FROM ( SELECT DISTINCT category FROM t_portfolio ) pd ) po JOIN t_portfolio p ON p.category <= po.category AND p.category >= po.category AND p.quality <= po.mquality AND (p.quality, p.id) <= (po.mquality, po.mid)
View query results and execution plan
We see that including a logically redundant condition into the ON
clause of the JOIN
made MySQL build a more efficient range for the index and reduced query time from 548 ms to 4 ms.
Hope that helps.
I'm always glad to answer the questions regarding database queries.