Groups holding highest ranked items
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Nate asks:
I know you've addressed similar issues related to the
greatest-per-groupquery but this seems to be a different take on that.Example table:
t_group item_id group_id score 100 1 2 100 2 3 200 1 1 300 1 4 300 2 2 Each item may be in multiple groups. Each instance of an item in that group is given a score (how relevant it is the the group).
So given the data above, when querying for group 1 it should return items 200 and 300 (item 100's highest score is for group 2, so it's excluded).
The classical greatest-n-per-group
problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.
Let's create a sample table:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_groups ( item_id INT NOT NULL, group_id INT NOT NULL, score INT NOT NULL, PRIMARY KEY (group_id, item_id), KEY ix_groups_gsi (item_id, score, group_id) ) ENGINE=InnoDB; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_groups SELECT (id - 1) % 1000 + 1, (id - 1) div 1000 + 1, CEILING(RAND(20100422) * 10000) FROM filler; [/sourcecode] </div> This table contains <strong>1,000,000</strong> records: <strong>1,000</strong> items in <strong>1,000</strong> groups with random scores. Let's write a query which would return us all items whose largest score is in group <strong>1</strong>. To do this, we need to select all items from group <strong>1</strong> and check that no other group has a greater value of <code>score</code> for that item. The most intuitive query for this would look like that: SELECT * FROM t_groups go WHERE group_id = 1 AND NOT EXISTS ( SELECT group_id FROM t_groups gi WHERE gi.item_id = go.item_id AND (gi.score, gi.group_id) > (go.score, go.group_id) )
item_id | group_id | score | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
288 | 1 | 9997 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
778 | 1 | 9995 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
970 | 1 | 9999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0002s (0.4210s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | go | ref | PRIMARY | PRIMARY | 4 | const | 1496 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | gi | ref | ix_groups_gsi | ix_groups_gsi | 4 | 20100422_rank.go.item_id | 366 | 100.00 | Using where; Using index |
Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1 Field or reference '20100422_rank.go.score' of SELECT #2 was resolved in SELECT #1 Field or reference '20100422_rank.go.group_id' of SELECT #2 was resolved in SELECT #1 select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (not(exists(select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where ((`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) and ((`20100422_rank`.`gi`.`score`,`20100422_rank`.`gi`.`group_id`) > (`20100422_rank`.`go`.`score`,`20100422_rank`.`go`.`group_id`)))))))
This query returns us 3 items and their score. We see that these items are not ranked higher in any other group.
However, this query is quite inefficient: it takes almost half a second. This is way too long for 1K items. Let's look how can it be optimized.
If we look into the query plan we will see that no range
access path is used for the subquery, despite the condition that can easily be optimized used such an access path. Instead, MySQL uses only the ref
access on item_id
, combined with Using where; Using index
.
What does it all mean?
We have the composite index on (item_id, score, group_id)
. This means that within the index, the records are ordered first on item_id
then by score
and then, in a case of a tie, by group_id
.
For group 1, MySQL should perform 1,000 comparisons: for each item within the group, the engine should make sure that no other group has a higher score for the same item.
Ideally, MySQL should have found each given set of (item_id, score, group_id)
in the index and then just make a single next key search to check if this record is last in the index within the given item_id
. That would show as a range
search in the query plan, since we actually are checking values between (item_id, score, group_id)
and (item_id, +INF, +INF)
.
However, MySQL cannot do such things in a correlated subquery. Instead, it uses the ref
access path: takes all records with the given item_id
(i. e. between (item_id, -INF, -INF)
and (item_id, +INF, +INF)
) and traverses them applying the WHERE
filter to each record.
For group 1, only 3 items are returned. This means that for majority of items (997 items), the whole index range for the items (1,000 records per item) had to be scanned.
And since the matching record is always the last one in the index range (because it's the one with the greatest score), this means that even for returned items, the whole ranged had to be scanned too. The only difference is that the last record in the range satisfies the WHERE
condition. No wonder it takes so long.
To speed up the query we need to trick MySQL a little.
Since it's always the last record in the index range that satisfies the WHERE
condition, why don't we just take it and compare? If we see that it holds our group, we return TRUE
, if we don't, we return FALSE
right away.
Here's how we can do this:
SELECT * FROM t_groups go WHERE group_id = 1 AND group_id = ( SELECT group_id FROM t_groups gi WHERE gi.item_id = go.item_id ORDER BY item_id DESC, score DESC, group_id DESC LIMIT 1 )
item_id | group_id | score | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
288 | 1 | 9997 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
778 | 1 | 9995 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
970 | 1 | 9999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0002s (0.0100s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | go | ref | PRIMARY | PRIMARY | 4 | const | 1496 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | gi | ref | ix_groups_gsi | ix_groups_gsi | 4 | 20100422_rank.go.item_id | 366 | 100.00 | Using where; Using index |
Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1 select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (1 = (select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where (`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) order by `20100422_rank`.`gi`.`item_id` desc,`20100422_rank`.`gi`.`score` desc,`20100422_rank`.`gi`.`group_id` desc limit 1)))
Instead of checking the existence of the last row in a range, we check it's value. This means that exactly one index record will be evaluated for each item within the group.
Note that the execution plan used by this query looks exactly the same as the first query's one. Same ref
condition on item_id
, same Using where; Using index
. The only difference (not shown in the EXPLAIN
output) it that the index is traversed in descending order now. Instead of fetching 1,000 records from the beginning to check the existence of the last one, we just take one record from the end and check its value.
Those who read my blog already know why do we use seemingly redundant item_id DESC
in the ORDER BY
clause here. To make MySQL to use descending index access path instead of a filesort in an ordered query, we should list all clauses that constitute the index, even if some of them are filtered by the WHERE
condition.
The second query completes in only 10 ms which is 40 times as fast as the original query.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Written by Quassnoi
April 22nd, 2010 at 11:00 pm
Posted in MySQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009