Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE
Comments enabled. I *really* need your comment
In the previous article I wrote about optimized emulation of the analytic functions in MySQL.
Now, let's try to optimize LAG
, LEAD
, FIRST_VALUE
and LAST_VALUE
.
Imagine we have a table that keeps actions of a PC in an online game. This table has the following design:
- Action id
- PC id
- Current PC level
- Current PC score
- Action data
For each action, the current level and current score of the PC are keeped. The table, of course, is designed in such a bad way just to illustrate our task :)
Now, for first 2 player characters, we need to know first 2 actions performed on first 2 levels; how much score did these actions yield; and how much score left to reach the next level.
With analytic functions, it would be done the following way:
SELECT id, pc, data, score_yielded, score_left FROM ( SELECT id, pc, data, LAG(score) OVER (PARTITION BY pc ORDER BY level, id) - score AS score_yielded, LAST_VALUE(score) OVER (PARTITION BY pc, level ORDER BY id) - score AS score_left, ROW_NUMBER() OVER (PARTITION BY pc, level ORDER BY id) AS rn FROM t_action WHERE pc IN (1, 2) AND level IN (1, 2) ORDER BY pc, level, id ) WHERE rn <= 2
Note that the PARTITION
and ORDER BY
clauses differ for the LAG()
. This is because we are comparing previous score to the current score to find the difference, and the previous action could be performed when the PC still was one level down. So this function is partitioned only by pc
.
How do we do it in MySQL?
- We create an index on
(pc, level, id)
. This will greatly speed up our query - In a subquery, we select those distinct values of
pc
andlevel
that satisfy our criteria. These values will be the base for the further joins - For each
(pc, level)
pair, we select the 2ndid
. We do it by usingORDER BY pc, level, id LIMIT 1, 1
in a correlated subquery in theSELECT
clause. The index created above greatly improves this subquery, as it's actually a one-row index lookup - In a same way, for each
(pc, level)
pair, we also select thescore
from the last row. Now we useORDER BY pc DESC, level DESC, id DESC LIMIT 1
. This emulatesLAST_VALUE()
. The index comes handy too - We join these results with
t_action
on a ranged condition:(pc, level) = (pc, level) AND (id < mid)
. This will select first two values for eachpc
andlevel
. This emulates using filtering condition onROW_NUMBER()
. To force MySQL to using ranged scan for this condition, we need to use trick with the join expression:a.pc <= adm.pc AND a.pc >= adm.pc
. This will make MySQL userange checking for each record
, which will choose access path on eachNESTED LOOP
iteration instead of selecting one path when preparing the query. Without this trick, MySQL would use aref
access path onpc, level
and loop through allid
values for each(pc, level)
pair - For each row of the joined
t_action
, we select thescore
of the previous row. We do it the same way: by usingORDER BY pc DESC, level DESC, id DESC LIMIT 1
we select first(pc, level, id)
combination that is less than current. Again, this is an instant one-row index lookup. This emulatesLAG()
function - Finally, we combine all the results together and perform the calculations
Here's the query:
SELECT a.*, score - COALESCE( ( SELECT score FROM t_action alag WHERE alag.pc = a.pc AND (alag.level, alag.id) < (a.level, a.id) ORDER BY pc DESC, level DESC, id DESC LIMIT 1 ), 0) AS score_yielded, last_score - score AS score_left FROM ( SELECT ad.pc, ad.level, ( SELECT id FROM t_action amid WHERE (amid.pc, amid.level) = (ad.pc, ad.level) ORDER BY pc, level, id LIMIT 1, 1 ) AS mid, ( SELECT score FROM t_action alast WHERE (alast.pc, alast.level) = (ad.pc, ad.level) ORDER BY pc DESC, level DESC, id DESC LIMIT 1 ) AS last_score FROM ( SELECT DISTINCT pc, level FROM t_action adi WHERE pc IN (1, 2) AND level IN (1, 2) ) ad ) adm, t_action a WHERE a.pc >= adm.pc AND a.pc <= adm.pc AND a.level = adm.level AND a.id <= mid
Here is the plan of the query…
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived3> | ALL | 4 | 100.00 | (null) | ||||
1 | PRIMARY | a | ALL | PRIMARY,ix_pc_level_score_id | 60123 | 100.00 | Range checked for each record (index map: 0x3) | |||
3 | DERIVED | <derived6> | ALL | 4 | 100.00 | (null) | ||||
6 | DERIVED | adi | range | ix_pc_level_score_id | ix_pc_level_score_id | 6 | 8 | 100.00 | Using where; Using index for group-by | |
5 | DEPENDENT SUBQUERY | alast | ref | ix_pc_level_score_id | ix_pc_level_score_id | 6 | ad.pc,ad.level | 509 | 100.00 | Using where |
4 | DEPENDENT SUBQUERY | amid | ref | ix_pc_level_score_id | ix_pc_level_score_id | 6 | ad.pc,ad.level | 509 | 100.00 | Using where; Using index |
2 | DEPENDENT SUBQUERY | alag | ref | ix_pc_level_score_id | ix_pc_level_score_id | 4 | 20090312_analytic.a.pc | 30061 | 100.00 | Using where |
… and the result:
id | pc | level | score | data | score_yielded | score_left | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 8761 | Action 1 | 8761 | 4967960 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 1 | 1 | 10493 | Action 2 | 1732 | 4966228 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1001 | 1 | 2 | 4977467 | Action 1001 | 746 | 5025854 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1002 | 1 | 2 | 4984376 | Action 1002 | 6909 | 5018945 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20001 | 2 | 1 | 2201 | Action 20001 | 2201 | 4932147 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20002 | 2 | 1 | 10574 | Action 20002 | 8373 | 4923774 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21001 | 2 | 2 | 4937581 | Action 21001 | 3233 | 4955065 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21002 | 2 | 2 | 4943672 | Action 21002 | 6091 | 4948974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0,0035s (0,5072s) |
As we can see, this query works and yields correct results, but takes almost half a second to complete.
What's the reason of such a low performance?
If we look into the plan, we will see that no dependent subqueries use range
access path. This is due to the bug in MySQL, which is unfortunately still not corrected.
The subqueries use the ref
access path with a where
condition. That means they use equality conditions on the first columns of the index (pc
and level
in case of mid
and last_score
; pc
only in case of score_yielded
), and scan all other columns to filter out the rows.
The subqueries have a LIMIT
clause, so they stop scanning as soon as they found the matching row. The matching row is second for mid
and first for last_score
, that's why the bug implies no performance loss for these queries.
But in case of score_yielded
, there are 20,000 rows for each player that need to be examined, and the row actually returned is in the very end, so all these rows are being scanned all right.
To improve the query performance, we need some workaround for this bug, and, fortunately, there is one.
We'll just wrap the subquery into a user-defined function that will retrieve the previous score
for any given pc
, level
and id
. The user-defined functions are not prone to this bug, and the plan for the subquery will use the optimal range
access path:
CREATE FUNCTION lag_score_partition_by_pc_order_by_level_id (_pc INT, _level SMALLINT, _id INT) RETURNS int(11) READS SQL DATA BEGIN DECLARE _score INT; DECLARE CONTINUE HANDLER FOR NOT FOUND SET _score = NULL; SELECT score INTO _score FROM t_action WHERE pc = _pc AND (level < _level OR (level = _level AND id < _id)) ORDER BY pc DESC, level DESC, id DESC LIMIT 1; RETURN _score; END
SELECT a.*, score - COALESCE(lag_score_partition_by_pc_order_by_level_id(a.pc, a.level, a.id), 0) AS score_yielded, last_score - score AS score_left FROM ( SELECT ad.pc, ad.level, ( SELECT id FROM t_action amid WHERE (amid.pc, amid.level) = (ad.pc, ad.level) ORDER BY pc, level, id LIMIT 1, 1 ) AS mid, ( SELECT score FROM t_action alast WHERE (alast.pc, alast.level) = (ad.pc, ad.level) ORDER BY pc DESC, level DESC, id DESC LIMIT 1 ) AS last_score FROM ( SELECT DISTINCT pc, level FROM t_action adi WHERE pc IN (1, 2) AND level IN (1, 2) ) ad ) adm, t_action a WHERE a.pc >= adm.pc AND a.pc <= adm.pc AND a.level = adm.level AND a.id <= mid
id | pc | level | score | data | score_yielded | score_left | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 8761 | Action 1 | 8761 | 4967960 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 1 | 1 | 10493 | Action 2 | 1732 | 4966228 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1001 | 1 | 2 | 4977467 | Action 1001 | 746 | 5025854 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1002 | 1 | 2 | 4984376 | Action 1002 | 6909 | 5018945 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20001 | 2 | 1 | 2201 | Action 20001 | 2201 | 4932147 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20002 | 2 | 1 | 10574 | Action 20002 | 8373 | 4923774 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21001 | 2 | 2 | 4937581 | Action 21001 | 3233 | 4955065 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21002 | 2 | 2 | 4943672 | Action 21002 | 6091 | 4948974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0,0040s (0,0030s) |
As we expected, the query execution time is greatly reduced and the query is almost instant now.