EXPLAIN EXTENDED

How to create fast database queries

Advanced row sampling

with one comment

Yesterday I wrote an article on how to emulate analytiс function ROW_NUMBER() that is present in SQL Server and Oracle, but absent in MySQL.

Today, we will try to optimize this query.

The original query scans all table, filtering out rows when necessary. This, as has already been said, if efficient if the number of rows is comparable to the number of grouper's.

But in real tables, the number of grouper's tends to be much less than the number of rows, leaving us some place for optimization.

The main drawback of the query above is that it always scans the whole table, while in fact it needs just a few items for each grouper.

What the query really needs to do, is:

  1. Select all grouper's
  2. For each grouper, select 15 first rows, that is rows with minimal id's
  3. Return the rows to the client

This could be easily achieved in any procedural language using simple conditional loops. But in MySQL, we cannot just loop a table in an SQL query with a counter. We need to supply the join conditions first. And, unfortunately for us, the NESTED LOOP count cannot be used as a join condition. We don't even have access to this variable from within the query.

We, though, may try to outsmart the database and fetch the join conditions we need.

What we really need here, is the 1st and the 15th id's for each grouper.

As we try to select rows sorted by id for each grouper here, we really need all rows that satisfy this condition: (grouper, id[grouper][1]) <= (row.grouper, row.id) <= (row.grouper, id[grouper][15])

15th id canbe easily selected as:

SELECT  id
FROM    t_limiter li
WHERE   li.grouper = @grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1

This will work instantly, as it will utilize the index on (grouper, id) which we created along with the table.

Now, what if we don't have the 15's value in the table? Like in sample table, for which I deliberatly created only 1 record for grouper 32?

In this case the query will return no rows. Empty recordset, beign returned as result a correlated subquery, is treated like a NULL, which we can easily coalesce to MAXVALUE (that is, 2^32 - 1).

Summarizing all said above, we will write the following query:

SELECT  l.*
FROM    (
SELECT  grouper,
COALESCE(
(
SELECT  id
FROM    t_limiter li
WHERE   li.grouper = dlo.grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM    (
SELECT  DISTINCT grouper
FROM    t_limiter dl
) dlo
) lo, t_limiter l
WHERE   l.grouper = lo.grouper
AND l.id &lt;= lo.mid
id grouper value
32 0 Value 32
64 0 Value 64
447 31 Value 447
479 31 Value 479
1000000 32 Value 1000000
481 rows fetched in 0,0047s (3,2374s)

This query works for… hmm, 3,2374 seconds, which is hardly an improvement compared to 3,4241 seconds from our previous query.

Let's look into the plan:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 33 100.00 (null)
1 PRIMARY l ref PRIMARY,ix_limiter_grouper_id ix_limiter_grouper_id 4 lo.grouper 4495 100.00 Using where
2 DERIVED <derived4> ALL 33 100.00 (null)
4 DERIVED dl range ix_limiter_grouper_id 4 225 100.00 Using index for group-by
3 DEPENDENT SUBQUERY li ref ix_limiter_grouper_id ix_limiter_grouper_id 4 dlo.grouper 4495 100.00 Using where; Using index

As you can see, the plan for some reason doesn't use range access method for the very outer instance on t_limiter (aliased as l in our query). Instead, it tries to scan the portions of index with a given grouper and filter out the matching id's using where.

This kills all performance benefit from using indexes, as it's exactly same thing we did in our previous, inefficient query.

Can we somehow convince MySQL to use the range access method instead?

Fortunately for us, yes.

As MySQL documentation says here:

Range checked for each record (index map: N)
MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known. For each row combination in the preceding tables, MySQL checks whether it is possible to use a rangeor index_merge access method to retrieve rows. This is not very fast, but is faster than performing a join with no index at all.

All we need to outsmart MySQL here is to add some uncertainty into the join conditions, so that the index will not be immediately eligible for ref access:

SELECT  l.*
FROM    (
SELECT  grouper,
COALESCE(
(
SELECT  id
FROM    t_limiter li
WHERE   li.grouper = dlo.grouper
ORDER BY
li.grouper, li.id
LIMIT 14, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM    (
SELECT  DISTINCT grouper
FROM    t_limiter dl
) dlo
) lo, t_limiter l
WHERE   l.grouper &gt;= lo.grouper
AND l.grouper &lt;= lo.grouper
        AND l.id &lt;= lo.mid

Note the new join conditions: l.grouper >= lo.grouper AND l.grouper <= lo.grouper. This condition of course is equivalent to l.grouper = lo.grouper, but MySQL now delays selection of access method till the query is run:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 33 100.00 (null)
1 PRIMARY l ALL PRIMARY,ix_limiter_grouper_id 997544 100.00 Range checked for each record (index map: 0x3)
2 DERIVED <derived4> ALL 33 100.00 (null)
4 DERIVED dl range ix_limiter_grouper_id 4 19 100.00 Using index for group-by
3 DEPENDENT SUBQUERY li ref ix_limiter_grouper_id ix_limiter_grouper_id 4 dlo.grouper 55419 100.00 Using where; Using index

Index map: 0x3 in the plan above is the bitmask of indices the optimizer will choose from when it comes to selecting an access method. In our case, it will choose from both indexes present in our table: PRIMARY ON id and ix_limiter_grouper_id ON (grouper, id)

Let's run the query and hope that MySQL will make the right decision:

id grouper value
32 0 Value 32
64 0 Value 64
447 31 Value 447
479 31 Value 479
1000000 32 Value 1000000
481 rows fetched in 0,0033s (0,0072s)

0,0072 seconds!

Wow!

Written by Quassnoi

March 6th, 2009 at 9:00 pm

Posted in MySQL

One Response to 'Advanced row sampling'

Subscribe to comments with RSS

  1. hi
    first thanks !!
    it is actually pretty amazing.

    now i wa wonder if you can help with a query than contains reference to other columns beside grouper and id

    my columns are (grouper :car_id, id: plan_datetime_start, and the others are :: order_status, cancel_reason)

    how can i get latest 15 orders ignore status=6 and cancel is not nulll

    yoni

    8 Aug 22 at 14:57

Leave a Reply