How to select N rows for a GROUP in MySQL at EXPLAIN EXTENDED

Advanced row sampling

Yesterday I wrote an article on how to emulate analytiс function ROW_NUMBER() that is present in SQL Server and Oracle, but absent in MySQL.

Today, we will try to optimize this query.

The original query scans all table, filtering out rows when necessary. This, as has already been said, if efficient if the number of rows is comparable to the number of grouper's.

But in real tables, the number of grouper's tends to be much less than the number of rows, leaving us some place for optimization.

The main drawback of the query above is that it always scans the whole table, while in fact it needs just a few items for each grouper.

What the query really needs to do, is:

Select all grouper's
For each grouper, select 15 first rows, that is rows with minimal id's
Return the rows to the client

This could be easily achieved in any procedural language using simple conditional loops. But in MySQL, we cannot just loop a table in an SQL query with a counter. We need to supply the join conditions first. And, unfortunately for us, the NESTED LOOP count cannot be used as a join condition. We don't even have access to this variable from within the query.

We, though, may try to outsmart the database and fetch the join conditions we need.

What we really need here, is the 1st and the 15th id's for each grouper.

As we try to select rows sorted by id for each grouper here, we really need all rows that satisfy this condition: (grouper, id[grouper][1]) <= (row.grouper, row.id) <= (row.grouper, id[grouper][15])

15th id canbe easily selected as:

SELECT  id
FROM    t_limiter li
WHERE   li.grouper = @grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1

This will work instantly, as it will utilize the index on (grouper, id) which we created along with the table.

Now, what if we don't have the 15's value in the table? Like in sample table, for which I deliberatly created only 1 record for grouper 32?

In this case the query will return no rows. Empty recordset, beign returned as result a correlated subquery, is treated like a NULL, which we can easily coalesce to MAXVALUE (that is, 2^32 - 1).

Summarizing all said above, we will write the following query:

SELECT  l.*
FROM    (
SELECT  grouper,
COALESCE(
(
SELECT  id
FROM    t_limiter li
WHERE   li.grouper = dlo.grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM    (
SELECT  DISTINCT grouper
FROM    t_limiter dl
) dlo
) lo, t_limiter l
WHERE   l.grouper = lo.grouper
AND l.id &lt;= lo.mid

id	grouper	value
32	0	Value 32
64	0	Value 64

447	31	Value 447
479	31	Value 479
1000000	32	Value 1000000
481 rows fetched in 0,0047s (3,2374s)

This query works for… hmm, 3,2374 seconds, which is hardly an improvement compared to 3,4241 seconds from our previous query.

Let's look into the plan:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					33	100.00	(null)
1	PRIMARY	l	ref	PRIMARY,ix_limiter_grouper_id	ix_limiter_grouper_id	4	lo.grouper	4495	100.00	Using where
2	DERIVED	<derived4>	ALL					33	100.00	(null)
4	DERIVED	dl	range		ix_limiter_grouper_id	4		225	100.00	Using index for group-by
3	DEPENDENT SUBQUERY	li	ref	ix_limiter_grouper_id	ix_limiter_grouper_id	4	dlo.grouper	4495	100.00	Using where; Using index

As you can see, the plan for some reason doesn't use range access method for the very outer instance on t_limiter (aliased as l in our query). Instead, it tries to scan the portions of index with a given grouper and filter out the matching id's using where.

This kills all performance benefit from using indexes, as it's exactly same thing we did in our previous, inefficient query.

Can we somehow convince MySQL to use the range access method instead?

Fortunately for us, yes.

As MySQL documentation says here:

Range checked for each record (index map: N)
MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known. For each row combination in the preceding tables, MySQL checks whether it is possible to use a rangeor index_merge access method to retrieve rows. This is not very fast, but is faster than performing a join with no index at all.

All we need to outsmart MySQL here is to add some uncertainty into the join conditions, so that the index will not be immediately eligible for ref access:

SELECT  l.*
FROM    (
SELECT  grouper,
COALESCE(
(
SELECT  id
FROM    t_limiter li
WHERE   li.grouper = dlo.grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM    (
SELECT  DISTINCT grouper
FROM    t_limiter dl
) dlo
) lo, t_limiter l
WHERE   l.grouper &gt;= lo.grouper
AND l.grouper &lt;= lo.grouper
        AND l.id &lt;= lo.mid

Note the new join conditions: l.grouper >= lo.grouper AND l.grouper <= lo.grouper. This condition of course is equivalent to l.grouper = lo.grouper, but MySQL now delays selection of access method till the query is run:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					33	100.00	(null)
1	PRIMARY	l	ALL	PRIMARY,ix_limiter_grouper_id				997544	100.00	Range checked for each record (index map: 0x3)
2	DERIVED	<derived4>	ALL					33	100.00	(null)
4	DERIVED	dl	range		ix_limiter_grouper_id	4		19	100.00	Using index for group-by
3	DEPENDENT SUBQUERY	li	ref	ix_limiter_grouper_id	ix_limiter_grouper_id	4	dlo.grouper	55419	100.00	Using where; Using index

Index map: 0x3 in the plan above is the bitmask of indices the optimizer will choose from when it comes to selecting an access method. In our case, it will choose from both indexes present in our table: PRIMARY ON id and ix_limiter_grouper_id ON (grouper, id)

Let's run the query and hope that MySQL will make the right decision:

id	grouper	value
32	0	Value 32
64	0	Value 64

447	31	Value 447
479	31	Value 479
1000000	32	Value 1000000
481 rows fetched in 0,0033s (0,0072s)

0,0072 seconds!

Wow!

Written by Quassnoi

March 6th, 2009 at 9:00 pm

Posted in MySQL

One Response to 'Advanced row sampling'

Subscribe to comments with RSS

hi
first thanks !!
it is actually pretty amazing.

now i wa wonder if you can help with a query than contains reference to other columns beside grouper and id

my columns are (grouper :car_id, id: plan_datetime_start, and the others are :: order_status, cancel_reason)

how can i get latest 15 orders ignore status=6 and cancel is not nulll

yoni

8 Aug 22 at 14:57

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

EXPLAIN EXTENDED

Advanced row sampling

One Response to 'Advanced row sampling'

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Advanced row sampling

Share this:

One Response to 'Advanced row sampling'

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow