EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Shared Plan and Algorithm Network Cache (SPANC). You're welcome to read and comment on it.

Groups holding highest ranked items

Answering questions asked on the site.

Nate asks:

I know you’ve addressed similar issues related to the greatest-per-group query but this seems to be a different take on that.

Example table:

t_group
item_id group_id score
100 1 2
100 2 3
200 1 1
300 1 4
300 2 2

Each item may be in multiple groups. Each instance of an item in that group is given a score (how relevant it is the the group).

So given the data above, when querying for group 1 it should return items 200 and 300 (item 100‘s highest score is for group 2, so it’s excluded).

The classical greatest-n-per-group problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.

Let’s create a sample table:

Table creation details

This table contains 1,000,000 records: 1,000 items in 1,000 groups with random scores.

Let’s write a query which would return us all items whose largest score is in group 1.

To do this, we need to select all items from group 1 and check that no other group has a greater value of score for that item. The most intuitive query for this would look like that:

SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND NOT EXISTS
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
                AND (gi.score, gi.group_id) > (go.score, go.group_id)
        )
item_id group_id score
288 1 9997
778 1 9995
970 1 9999
3 rows fetched in 0.0002s (0.4210s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY go ref PRIMARY PRIMARY 4 const 1496 100.00 Using where
2 DEPENDENT SUBQUERY gi ref ix_groups_gsi ix_groups_gsi 4 20100422_rank.go.item_id 366 100.00 Using where; Using index
Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1
Field or reference '20100422_rank.go.score' of SELECT #2 was resolved in SELECT #1
Field or reference '20100422_rank.go.group_id' of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (not(exists(select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where ((`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) and ((`20100422_rank`.`gi`.`score`,`20100422_rank`.`gi`.`group_id`) > (`20100422_rank`.`go`.`score`,`20100422_rank`.`go`.`group_id`)))))))

This query returns us 3 items and their score. We see that these items are not ranked higher in any other group.

However, this query is quite inefficient: it takes almost half a second. This is way too long for 1K items. Let’s look how can it be optimized.

If we look into the query plan we will see that no range access path is used for the subquery, despite the condition that can easily be optimized used such an access path. Instead, MySQL uses only the ref access on item_id, combined with Using where; Using index.

What does it all mean?

We have the composite index on (item_id, score, group_id). This means that within the index, the records are ordered first on item_id then by score and then, in a case of a tie, by group_id.

For group 1, MySQL should perform 1,000 comparisons: for each item within the group, the engine should make sure that no other group has a higher score for the same item.

Ideally, MySQL should have found each given set of (item_id, score, group_id) in the index and then just make a single next key search to check if this record is last in the index within the given item_id. That would show as a range search in the query plan, since we actually are checking values between (item_id, score, group_id) and (item_id, +INF, +INF).

However, MySQL cannot do such things in a correlated subquery. Instead, it uses the ref access path: takes all records with the given item_id (i. e. between (item_id, -INF, -INF) and (item_id, +INF, +INF)) and traverses them applying the WHERE filter to each record.

For group 1, only 3 items are returned. This means that for majority of items (997 items), the whole index range for the items (1,000 records per item) had to be scanned.

And since the matching record is always the last one in the index range (because it’s the one with the greatest score), this means that even for returned items, the whole ranged had to be scanned too. The only difference is that the last record in the range satisfies the WHERE condition. No wonder it takes so long.

To speed up the query we need to trick MySQL a little.

Since it’s always the last record in the index range that satisfies the WHERE condition, why don’t we just take it and compare? If we see that it holds our group, we return TRUE, if we don’t, we return FALSE right away.

Here’s how we can do this:

SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND group_id =
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
        ORDER BY
                item_id DESC, score DESC, group_id DESC
        LIMIT 1
        )
item_id group_id score
288 1 9997
778 1 9995
970 1 9999
3 rows fetched in 0.0002s (0.0100s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY go ref PRIMARY PRIMARY 4 const 1496 100.00 Using where
2 DEPENDENT SUBQUERY gi ref ix_groups_gsi ix_groups_gsi 4 20100422_rank.go.item_id 366 100.00 Using where; Using index
Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (1 = (select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where (`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) order by `20100422_rank`.`gi`.`item_id` desc,`20100422_rank`.`gi`.`score` desc,`20100422_rank`.`gi`.`group_id` desc limit 1)))

Instead of checking the existence of the last row in a range, we check it’s value. This means that exactly one index record will be evaluated for each item within the group.

Note that the execution plan used by this query looks exactly the same as the first query’s one. Same ref condition on item_id, same Using where; Using index. The only difference (not shown in the EXPLAIN output) it that the index is traversed in descending order now. Instead of fetching 1,000 records from the beginning to check the existence of the last one, we just take one record from the end and check its value.

Those who read my blog already know why do we use seemingly redundant item_id DESC in the ORDER BY clause here. To make MySQL to use descending index access path instead of a filesort in an ordered query, we should list all clauses that constitute the index, even if some of them are filtered by the WHERE condition.

The second query completes in only 10 ms which is 40 times as fast as the original query.

Hope that helps.


I’m always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 22nd, 2010 at 11:00 pm

Posted in MySQL

Comments are closed.