Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE
Comments enabled. I *really* need your comment
In the previous article I wrote about optimized emulation of the analytic functions in MySQL.
Now, let's try to optimize LAG, LEAD, FIRST_VALUE and LAST_VALUE.
Imagine we have a table that keeps actions of a PC in an online game. This table has the following design:
- Action id
- PC id
- Current PC level
- Current PC score
- Action data
For each action, the current level and current score of the PC are keeped. The table, of course, is designed in such a bad way just to illustrate our task :)
Now, for first 2 player characters, we need to know first 2 actions performed on first 2 levels; how much score did these actions yield; and how much score left to reach the next level.
With analytic functions, it would be done the following way:
SELECT id, pc, data, score_yielded, score_left FROM ( SELECT id, pc, data, LAG(score) OVER (PARTITION BY pc ORDER BY level, id) - score AS score_yielded, LAST_VALUE(score) OVER (PARTITION BY pc, level ORDER BY id) - score AS score_left, ROW_NUMBER() OVER (PARTITION BY pc, level ORDER BY id) AS rn FROM t_action WHERE pc IN (1, 2) AND level IN (1, 2) ORDER BY pc, level, id ) WHERE rn <= 2
Note that the PARTITION and ORDER BY clauses differ for the LAG(). This is because we are comparing previous score to the current score to find the difference, and the previous action could be performed when the PC still was one level down. So this function is partitioned only by pc.
How do we do it in MySQL?
- We create an index on
(pc, level, id). This will greatly speed up our query - In a subquery, we select those distinct values of
pcandlevelthat satisfy our criteria. These values will be the base for the further joins - For each
(pc, level)pair, we select the 2ndid. We do it by usingORDER BY pc, level, id LIMIT 1, 1in a correlated subquery in theSELECTclause. The index created above greatly improves this subquery, as it's actually a one-row index lookup - In a same way, for each
(pc, level)pair, we also select thescorefrom the last row. Now we useORDER BY pc DESC, level DESC, id DESC LIMIT 1. This emulatesLAST_VALUE(). The index comes handy too - We join these results with
t_actionon a ranged condition:(pc, level) = (pc, level) AND (id < mid). This will select first two values for eachpcandlevel. This emulates using filtering condition onROW_NUMBER(). To force MySQL to using ranged scan for this condition, we need to use trick with the join expression:a.pc <= adm.pc AND a.pc >= adm.pc. This will make MySQL userange checking for each record, which will choose access path on eachNESTED LOOPiteration instead of selecting one path when preparing the query. Without this trick, MySQL would use arefaccess path onpc, leveland loop through allidvalues for each(pc, level)pair - For each row of the joined
t_action, we select thescoreof the previous row. We do it the same way: by usingORDER BY pc DESC, level DESC, id DESC LIMIT 1we select first(pc, level, id)combination that is less than current. Again, this is an instant one-row index lookup. This emulatesLAG()function - Finally, we combine all the results together and perform the calculations
Here's the query:
SELECT a.*,
score - COALESCE(
(
SELECT score
FROM t_action alag
WHERE alag.pc = a.pc
AND (alag.level, alag.id) < (a.level, a.id)
ORDER BY
pc DESC, level DESC, id DESC
LIMIT 1
), 0) AS score_yielded,
last_score - score AS score_left
FROM (
SELECT ad.pc, ad.level,
(
SELECT id
FROM t_action amid
WHERE (amid.pc, amid.level) = (ad.pc, ad.level)
ORDER BY
pc, level, id
LIMIT 1, 1
) AS mid,
(
SELECT score
FROM t_action alast
WHERE (alast.pc, alast.level) = (ad.pc, ad.level)
ORDER BY
pc DESC, level DESC, id DESC
LIMIT 1
) AS last_score
FROM (
SELECT DISTINCT pc, level
FROM t_action adi
WHERE pc IN (1, 2)
AND level IN (1, 2)
) ad
) adm,
t_action a
WHERE a.pc >= adm.pc
AND a.pc <= adm.pc
AND a.level = adm.level
AND a.id <= mid
Here is the plan of the query…
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived3> | ALL | 4 | 100.00 | (null) | ||||
| 1 | PRIMARY | a | ALL | PRIMARY,ix_pc_level_score_id | 60123 | 100.00 | Range checked for each record (index map: 0x3) | |||
| 3 | DERIVED | <derived6> | ALL | 4 | 100.00 | (null) | ||||
| 6 | DERIVED | adi | range | ix_pc_level_score_id | ix_pc_level_score_id | 6 | 8 | 100.00 | Using where; Using index for group-by | |
| 5 | DEPENDENT SUBQUERY | alast | ref | ix_pc_level_score_id | ix_pc_level_score_id | 6 | ad.pc,ad.level | 509 | 100.00 | Using where |
| 4 | DEPENDENT SUBQUERY | amid | ref | ix_pc_level_score_id | ix_pc_level_score_id | 6 | ad.pc,ad.level | 509 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | alag | ref | ix_pc_level_score_id | ix_pc_level_score_id | 4 | 20090312_analytic.a.pc | 30061 | 100.00 | Using where |
… and the result:
| id | pc | level | score | data | score_yielded | score_left | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 8761 | Action 1 | 8761 | 4967960 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | 1 | 1 | 10493 | Action 2 | 1732 | 4966228 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1001 | 1 | 2 | 4977467 | Action 1001 | 746 | 5025854 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1002 | 1 | 2 | 4984376 | Action 1002 | 6909 | 5018945 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20001 | 2 | 1 | 2201 | Action 20001 | 2201 | 4932147 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20002 | 2 | 1 | 10574 | Action 20002 | 8373 | 4923774 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21001 | 2 | 2 | 4937581 | Action 21001 | 3233 | 4955065 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21002 | 2 | 2 | 4943672 | Action 21002 | 6091 | 4948974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0,0035s (0,5072s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As we can see, this query works and yields correct results, but takes almost half a second to complete.
What's the reason of such a low performance?
If we look into the plan, we will see that no dependent subqueries use range access path. This is due to the bug in MySQL, which is unfortunately still not corrected.
The subqueries use the ref access path with a where condition. That means they use equality conditions on the first columns of the index (pc and level in case of mid and last_score; pc only in case of score_yielded), and scan all other columns to filter out the rows.
The subqueries have a LIMIT clause, so they stop scanning as soon as they found the matching row. The matching row is second for mid and first for last_score, that's why the bug implies no performance loss for these queries.
But in case of score_yielded, there are 20,000 rows for each player that need to be examined, and the row actually returned is in the very end, so all these rows are being scanned all right.
To improve the query performance, we need some workaround for this bug, and, fortunately, there is one.
We'll just wrap the subquery into a user-defined function that will retrieve the previous score for any given pc, level and id. The user-defined functions are not prone to this bug, and the plan for the subquery will use the optimal range access path:
CREATE FUNCTION lag_score_partition_by_pc_order_by_level_id (_pc INT, _level SMALLINT, _id INT) RETURNS int(11)
READS SQL DATA
BEGIN
DECLARE _score INT;
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET _score = NULL;
SELECT score
INTO _score
FROM t_action
WHERE pc = _pc
AND (level < _level OR (level = _level AND id < _id))
ORDER BY
pc DESC, level DESC, id DESC
LIMIT 1;
RETURN _score;
END
SELECT a.*,
score - COALESCE(lag_score_partition_by_pc_order_by_level_id(a.pc, a.level, a.id), 0) AS score_yielded,
last_score - score AS score_left
FROM (
SELECT ad.pc, ad.level,
(
SELECT id
FROM t_action amid
WHERE (amid.pc, amid.level) = (ad.pc, ad.level)
ORDER BY
pc, level, id
LIMIT 1, 1
) AS mid,
(
SELECT score
FROM t_action alast
WHERE (alast.pc, alast.level) = (ad.pc, ad.level)
ORDER BY
pc DESC, level DESC, id DESC
LIMIT 1
) AS last_score
FROM (
SELECT DISTINCT pc, level
FROM t_action adi
WHERE pc IN (1, 2)
AND level IN (1, 2)
) ad
) adm,
t_action a
WHERE a.pc >= adm.pc
AND a.pc <= adm.pc
AND a.level = adm.level
AND a.id <= mid
| id | pc | level | score | data | score_yielded | score_left | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 8761 | Action 1 | 8761 | 4967960 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | 1 | 1 | 10493 | Action 2 | 1732 | 4966228 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1001 | 1 | 2 | 4977467 | Action 1001 | 746 | 5025854 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1002 | 1 | 2 | 4984376 | Action 1002 | 6909 | 5018945 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20001 | 2 | 1 | 2201 | Action 20001 | 2201 | 4932147 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20002 | 2 | 1 | 10574 | Action 20002 | 8373 | 4923774 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21001 | 2 | 2 | 4937581 | Action 21001 | 3233 | 4955065 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21002 | 2 | 2 | 4943672 | Action 21002 | 6091 | 4948974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0,0040s (0,0030s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As we expected, the query execution time is greatly reduced and the query is almost instant now.
Subscribe in a reader