EXPLAIN EXTENDED

How to create fast database queries

Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE

Comments enabled. I *really* need your comment

In the previous article I wrote about optimized emulation of the analytic functions in MySQL.

Now, let's try to optimize LAG, LEAD, FIRST_VALUE and LAST_VALUE.

Imagine we have a table that keeps actions of a PC in an online game. This table has the following design:

  1. Action id
  2. PC id
  3. Current PC level
  4. Current PC score
  5. Action data

For each action, the current level and current score of the PC are keeped. The table, of course, is designed in such a bad way just to illustrate our task :)

Now, for first 2 player characters, we need to know first 2 actions performed on first 2 levels; how much score did these actions yield; and how much score left to reach the next level.

With analytic functions, it would be done the following way:

SELECT  id, pc, data, score_yielded, score_left
FROM    (
SELECT  id, pc, data,
LAG(score) OVER (PARTITION BY pc ORDER BY level, id) - score AS score_yielded,
LAST_VALUE(score) OVER (PARTITION BY pc, level ORDER BY id) - score AS score_left,
ROW_NUMBER() OVER (PARTITION BY pc, level ORDER BY id) AS rn
FROM    t_action
WHERE   pc IN (1, 2)
AND level IN (1, 2)
ORDER BY
pc, level, id
)
WHERE       rn <= 2

Note that the PARTITION and ORDER BY clauses differ for the LAG(). This is because we are comparing previous score to the current score to find the difference, and the previous action could be performed when the PC still was one level down. So this function is partitioned only by pc.

How do we do it in MySQL?

  1. We create an index on (pc, level, id). This will greatly speed up our query
  2. In a subquery, we select those distinct values of pc and level that satisfy our criteria. These values will be the base for the further joins
  3. For each (pc, level) pair, we select the 2nd id. We do it by using ORDER BY pc, level, id LIMIT 1, 1 in a correlated subquery in the SELECT clause. The index created above greatly improves this subquery, as it's actually a one-row index lookup
  4. In a same way, for each (pc, level) pair, we also select the score from the last row. Now we use ORDER BY pc DESC, level DESC, id DESC LIMIT 1. This emulates LAST_VALUE(). The index comes handy too
  5. We join these results with t_action on a ranged condition: (pc, level) = (pc, level) AND (id < mid). This will select first two values for each pc and level. This emulates using filtering condition on ROW_NUMBER(). To force MySQL to using ranged scan for this condition, we need to use trick with the join expression: a.pc <= adm.pc AND a.pc >= adm.pc. This will make MySQL use range checking for each record, which will choose access path on each NESTED LOOP iteration instead of selecting one path when preparing the query. Without this trick, MySQL would use a ref access path on pc, level and loop through all id values for each (pc, level) pair
  6. For each row of the joined t_action, we select the score of the previous row. We do it the same way: by using ORDER BY pc DESC, level DESC, id DESC LIMIT 1 we select first (pc, level, id) combination that is less than current. Again, this is an instant one-row index lookup. This emulates LAG() function
  7. Finally, we combine all the results together and perform the calculations

Here's the query:

SELECT  a.*,
score - COALESCE(
(
SELECT  score
FROM    t_action alag
WHERE   alag.pc = a.pc
AND (alag.level, alag.id) &lt; (a.level, a.id)
        ORDER BY
                pc DESC, level DESC, id DESC
        LIMIT 1
        ), 0) AS score_yielded,
        last_score - score AS score_left
FROM    (
        SELECT  ad.pc, ad.level,
                (
                SELECT  id
                FROM    t_action amid
                WHERE   (amid.pc, amid.level) = (ad.pc, ad.level)
                ORDER BY
                        pc, level, id
                LIMIT   1, 1
                ) AS mid,
                (
                SELECT  score
                FROM    t_action alast
                WHERE   (alast.pc, alast.level) = (ad.pc, ad.level)
                ORDER BY
                        pc DESC, level DESC, id DESC
                LIMIT   1
                ) AS last_score
        FROM    (
                SELECT DISTINCT pc, level
                FROM   t_action adi
                WHERE  pc IN (1, 2)
                       AND level IN (1, 2)
                ) ad
        ) adm,
        t_action a
WHERE   a.pc &gt;= adm.pc
AND a.pc &lt;= adm.pc
        AND a.level = adm.level
        AND a.id &lt;= mid

Here is the plan of the query…

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived3> ALL 4 100.00 (null)
1 PRIMARY a ALL PRIMARY,ix_pc_level_score_id 60123 100.00 Range checked for each record (index map: 0x3)
3 DERIVED <derived6> ALL 4 100.00 (null)
6 DERIVED adi range ix_pc_level_score_id ix_pc_level_score_id 6 8 100.00 Using where; Using index for group-by
5 DEPENDENT SUBQUERY alast ref ix_pc_level_score_id ix_pc_level_score_id 6 ad.pc,ad.level 509 100.00 Using where
4 DEPENDENT SUBQUERY amid ref ix_pc_level_score_id ix_pc_level_score_id 6 ad.pc,ad.level 509 100.00 Using where; Using index
2 DEPENDENT SUBQUERY alag ref ix_pc_level_score_id ix_pc_level_score_id 4 20090312_analytic.a.pc 30061 100.00 Using where

… and the result:

id pc level score data score_yielded score_left
1 1 1 8761 Action 1 8761 4967960
2 1 1 10493 Action 2 1732 4966228
1001 1 2 4977467 Action 1001 746 5025854
1002 1 2 4984376 Action 1002 6909 5018945
20001 2 1 2201 Action 20001 2201 4932147
20002 2 1 10574 Action 20002 8373 4923774
21001 2 2 4937581 Action 21001 3233 4955065
21002 2 2 4943672 Action 21002 6091 4948974
8 rows fetched in 0,0035s (0,5072s)

As we can see, this query works and yields correct results, but takes almost half a second to complete.

What's the reason of such a low performance?

If we look into the plan, we will see that no dependent subqueries use range access path. This is due to the bug in MySQL, which is unfortunately still not corrected.

The subqueries use the ref access path with a where condition. That means they use equality conditions on the first columns of the index (pc and level in case of mid and last_score; pc only in case of score_yielded), and scan all other columns to filter out the rows.

The subqueries have a LIMIT clause, so they stop scanning as soon as they found the matching row. The matching row is second for mid and first for last_score, that's why the bug implies no performance loss for these queries.

But in case of score_yielded, there are 20,000 rows for each player that need to be examined, and the row actually returned is in the very end, so all these rows are being scanned all right.

To improve the query performance, we need some workaround for this bug, and, fortunately, there is one.

We'll just wrap the subquery into a user-defined function that will retrieve the previous score for any given pc, level and id. The user-defined functions are not prone to this bug, and the plan for the subquery will use the optimal range access path:

CREATE FUNCTION lag_score_partition_by_pc_order_by_level_id (_pc INT, _level SMALLINT, _id INT) RETURNS int(11)
READS SQL DATA
BEGIN
DECLARE _score INT;
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET _score = NULL;
SELECT  score
INTO    _score
FROM    t_action
WHERE   pc = _pc
AND (level &lt; _level OR (level = _level AND id &lt; _id))
        ORDER BY
                pc DESC, level DESC, id DESC
        LIMIT 1;
        RETURN _score;
END
SELECT  a.*,
        score - COALESCE(lag_score_partition_by_pc_order_by_level_id(a.pc, a.level, a.id), 0) AS score_yielded,
        last_score - score AS score_left
FROM    (
        SELECT  ad.pc, ad.level,
                (
                SELECT  id
                FROM    t_action amid
                WHERE   (amid.pc, amid.level) = (ad.pc, ad.level)
                ORDER BY
                        pc, level, id
                LIMIT   1, 1
                ) AS mid,
                (
                SELECT  score
                FROM    t_action alast
                WHERE   (alast.pc, alast.level) = (ad.pc, ad.level)
                ORDER BY
                        pc DESC, level DESC, id DESC
                LIMIT   1
                ) AS last_score
        FROM    (
                SELECT DISTINCT pc, level
                FROM   t_action adi
                WHERE  pc IN (1, 2)
                       AND level IN (1, 2)
                ) ad
        ) adm,
        t_action a
WHERE   a.pc &gt;= adm.pc
AND a.pc &lt;= adm.pc
        AND a.level = adm.level
        AND a.id &lt;= mid
id pc level score data score_yielded score_left
1 1 1 8761 Action 1 8761 4967960
2 1 1 10493 Action 2 1732 4966228
1001 1 2 4977467 Action 1001 746 5025854
1002 1 2 4984376 Action 1002 6909 5018945
20001 2 1 2201 Action 20001 2201 4932147
20002 2 1 10574 Action 20002 8373 4923774
21001 2 2 4937581 Action 21001 3233 4955065
21002 2 2 4943672 Action 21002 6091 4948974
8 rows fetched in 0,0040s (0,0030s)

As we expected, the query execution time is greatly reduced and the query is almost instant now.

Written by Quassnoi

March 12th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply