Advanced row sampling
Yesterday I wrote an article on how to emulate analytiс function ROW_NUMBER()
that is present in SQL Server and Oracle, but absent in MySQL.
Today, we will try to optimize this query.
The original query scans all table, filtering out rows when necessary. This, as has already been said, if efficient if the number of rows is comparable to the number of grouper
's.
But in real tables, the number of grouper
's tends to be much less than the number of rows, leaving us some place for optimization.
The main drawback of the query above is that it always scans the whole table, while in fact it needs just a few items for each grouper
.
What the query really needs to do, is:
- Select all
grouper
's - For each
grouper
, select 15 first rows, that is rows with minimalid
's - Return the rows to the client
This could be easily achieved in any procedural language using simple conditional loops. But in MySQL, we cannot just loop a table in an SQL
query with a counter. We need to supply the join conditions first. And, unfortunately for us, the NESTED LOOP
count cannot be used as a join condition. We don't even have access to this variable from within the query.
We, though, may try to outsmart the database and fetch the join conditions we need.
What we really need here, is the 1st and the 15th id
's for each grouper
.
As we try to select rows sorted by id
for each grouper
here, we really need all rows that satisfy this condition: (grouper, id[grouper][1]) <= (row.grouper, row.id) <= (row.grouper, id[grouper][15])
15th id
canbe easily selected as:
SELECT id FROM t_limiter li WHERE li.grouper = @grouper ORDER BY li.grouper, li.id LIMIT 14, 1
This will work instantly, as it will utilize the index on (grouper, id)
which we created along with the table.
Now, what if we don't have the 15's value in the table? Like in sample table, for which I deliberatly created only 1 record for grouper
32?
In this case the query will return no rows. Empty recordset, beign returned as result a correlated subquery, is treated like a NULL
, which we can easily coalesce to MAXVALUE
(that is, 2^32 - 1
).
Summarizing all said above, we will write the following query:
SELECT l.* FROM ( SELECT grouper, COALESCE( ( SELECT id FROM t_limiter li WHERE li.grouper = dlo.grouper ORDER BY li.grouper, li.id LIMIT 14, 1 ), CAST(0xFFFFFFFF AS DECIMAL)) AS mid FROM ( SELECT DISTINCT grouper FROM t_limiter dl ) dlo ) lo, t_limiter l WHERE l.grouper = lo.grouper AND l.id <= lo.mid
id | grouper | value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | 0 | Value 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64 | 0 | Value 64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
447 | 31 | Value 447 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
479 | 31 | Value 479 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000000 | 32 | Value 1000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
481 rows fetched in 0,0047s (3,2374s) |
This query works for… hmm, 3,2374 seconds, which is hardly an improvement compared to 3,4241 seconds from our previous query.
Let's look into the plan:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 33 | 100.00 | (null) | ||||
1 | PRIMARY | l | ref | PRIMARY,ix_limiter_grouper_id | ix_limiter_grouper_id | 4 | lo.grouper | 4495 | 100.00 | Using where |
2 | DERIVED | <derived4> | ALL | 33 | 100.00 | (null) | ||||
4 | DERIVED | dl | range | ix_limiter_grouper_id | 4 | 225 | 100.00 | Using index for group-by | ||
3 | DEPENDENT SUBQUERY | li | ref | ix_limiter_grouper_id | ix_limiter_grouper_id | 4 | dlo.grouper | 4495 | 100.00 | Using where; Using index |
As you can see, the plan for some reason doesn't use range
access method for the very outer instance on t_limiter
(aliased as l
in our query). Instead, it tries to scan the portions of index with a given grouper
and filter out the matching id
's using where.
This kills all performance benefit from using indexes, as it's exactly same thing we did in our previous, inefficient query.
Can we somehow convince MySQL to use the range access method instead?
Fortunately for us, yes.
As MySQL documentation says here:
Range checked for each record (index map: N)
MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known. For each row combination in the preceding tables, MySQL checks whether it is possible to use arange
orindex_merge
access method to retrieve rows. This is not very fast, but is faster than performing a join with no index at all.
All we need to outsmart MySQL here is to add some uncertainty into the join conditions, so that the index
will not be immediately eligible for ref
access:
SELECT l.* FROM ( SELECT grouper, COALESCE( ( SELECT id FROM t_limiter li WHERE li.grouper = dlo.grouper ORDER BY li.grouper, li.id LIMIT 14, 1 ), CAST(0xFFFFFFFF AS DECIMAL)) AS mid FROM ( SELECT DISTINCT grouper FROM t_limiter dl ) dlo ) lo, t_limiter l WHERE l.grouper >= lo.grouper AND l.grouper <= lo.grouper AND l.id <= lo.mid
Note the new join conditions: l.grouper >= lo.grouper AND l.grouper <= lo.grouper
. This condition of course is equivalent to l.grouper = lo.grouper
, but MySQL now delays selection of access method till the query is run:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 33 | 100.00 | (null) | ||||
1 | PRIMARY | l | ALL | PRIMARY,ix_limiter_grouper_id | 997544 | 100.00 | Range checked for each record (index map: 0x3) | |||
2 | DERIVED | <derived4> | ALL | 33 | 100.00 | (null) | ||||
4 | DERIVED | dl | range | ix_limiter_grouper_id | 4 | 19 | 100.00 | Using index for group-by | ||
3 | DEPENDENT SUBQUERY | li | ref | ix_limiter_grouper_id | ix_limiter_grouper_id | 4 | dlo.grouper | 55419 | 100.00 | Using where; Using index |
Index map: 0x3
in the plan above is the bitmask of indices the optimizer will choose from when it comes to selecting an access method. In our case, it will choose from both indexes present in our table: PRIMARY ON id
and ix_limiter_grouper_id ON (grouper, id)
Let's run the query and hope that MySQL will make the right decision:
id | grouper | value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | 0 | Value 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64 | 0 | Value 64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
447 | 31 | Value 447 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
479 | 31 | Value 479 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000000 | 32 | Value 1000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
481 rows fetched in 0,0033s (0,0072s) |
0,0072 seconds!
Wow!
hi
first thanks !!
it is actually pretty amazing.
now i wa wonder if you can help with a query than contains reference to other columns beside grouper and id
my columns are (grouper :car_id, id: plan_datetime_start, and the others are :: order_status, cancel_reason)
how can i get latest 15 orders ignore status=6 and cancel is not nulll
yoni
8 Aug 22 at 14:57