Groups holding highest ranked items at EXPLAIN EXTENDED

Groups holding highest ranked items

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Nate asks:

I know you've addressed similar issues related to the greatest-per-group query but this seems to be a different take on that.

Example table:

t_group
item_id group_id score

100 1 2

100 2 3

200 1 1

300 1 4

300 2 2

Each item may be in multiple groups. Each instance of an item in that group is given a score (how relevant it is the the group).

So given the data above, when querying for group 1 it should return items 200 and 300 (item 100's highest score is for group 2, so it's excluded).

t_group
item_id	group_id	score
100	1	2
100	2	3
200	1	1
300	1	4
300	2	2

The classical greatest-n-per-group problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.

Let's create a sample table:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_groups (
        item_id INT NOT NULL,
        group_id INT NOT NULL,
        score INT NOT NULL,
        PRIMARY KEY (group_id, item_id),
        KEY ix_groups_gsi (item_id, score, group_id)
) ENGINE=InnoDB;              

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_groups
SELECT  (id - 1) % 1000 + 1,
        (id - 1) div 1000 + 1,
        CEILING(RAND(20100422) * 10000)
FROM    filler;
&#91;/sourcecode&#93;
</div>

This table contains <strong>1,000,000</strong> records: <strong>1,000</strong> items in <strong>1,000</strong> groups with random scores.

Let's write a query which would return us all items whose largest score is in group <strong>1</strong>.

To do this, we need to select all items from group <strong>1</strong> and check that no other group has a greater value of <code>score</code> for that item. The most intuitive query for this would look like that:


SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND NOT EXISTS
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
                AND (gi.score, gi.group_id) > (go.score, go.group_id)
        )

item_id	group_id	score
288	1	9997
778	1	9995
970	1	9999
3 rows fetched in 0.0002s (0.4210s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	go	ref	PRIMARY	PRIMARY	4	const	1496	100.00	Using where
2	DEPENDENT SUBQUERY	gi	ref	ix_groups_gsi	ix_groups_gsi	4	20100422_rank.go.item_id	366	100.00	Using where; Using index

Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1
Field or reference '20100422_rank.go.score' of SELECT #2 was resolved in SELECT #1
Field or reference '20100422_rank.go.group_id' of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (not(exists(select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where ((`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) and ((`20100422_rank`.`gi`.`score`,`20100422_rank`.`gi`.`group_id`) > (`20100422_rank`.`go`.`score`,`20100422_rank`.`go`.`group_id`)))))))

This query returns us 3 items and their score. We see that these items are not ranked higher in any other group.

However, this query is quite inefficient: it takes almost half a second. This is way too long for 1K items. Let's look how can it be optimized.

If we look into the query plan we will see that no range access path is used for the subquery, despite the condition that can easily be optimized used such an access path. Instead, MySQL uses only the ref access on item_id, combined with Using where; Using index.

What does it all mean?

We have the composite index on (item_id, score, group_id). This means that within the index, the records are ordered first on item_id then by score and then, in a case of a tie, by group_id.

For group 1, MySQL should perform 1,000 comparisons: for each item within the group, the engine should make sure that no other group has a higher score for the same item.

Ideally, MySQL should have found each given set of (item_id, score, group_id) in the index and then just make a single next key search to check if this record is last in the index within the given item_id. That would show as a range search in the query plan, since we actually are checking values between (item_id, score, group_id) and (item_id, +INF, +INF).

However, MySQL cannot do such things in a correlated subquery. Instead, it uses the ref access path: takes all records with the given item_id (i. e. between (item_id, -INF, -INF) and (item_id, +INF, +INF)) and traverses them applying the WHERE filter to each record.

For group 1, only 3 items are returned. This means that for majority of items (997 items), the whole index range for the items (1,000 records per item) had to be scanned.

And since the matching record is always the last one in the index range (because it's the one with the greatest score), this means that even for returned items, the whole ranged had to be scanned too. The only difference is that the last record in the range satisfies the WHERE condition. No wonder it takes so long.

To speed up the query we need to trick MySQL a little.

Since it's always the last record in the index range that satisfies the WHERE condition, why don't we just take it and compare? If we see that it holds our group, we return TRUE, if we don't, we return FALSE right away.

Here's how we can do this:

SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND group_id =
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
        ORDER BY
                item_id DESC, score DESC, group_id DESC
        LIMIT 1
        )

item_id	group_id	score
288	1	9997
778	1	9995
970	1	9999
3 rows fetched in 0.0002s (0.0100s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	go	ref	PRIMARY	PRIMARY	4	const	1496	100.00	Using where
2	DEPENDENT SUBQUERY	gi	ref	ix_groups_gsi	ix_groups_gsi	4	20100422_rank.go.item_id	366	100.00	Using where; Using index

Field or reference '20100422_rank.go.item_id' of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (1 = (select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where (`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) order by `20100422_rank`.`gi`.`item_id` desc,`20100422_rank`.`gi`.`score` desc,`20100422_rank`.`gi`.`group_id` desc limit 1)))

Instead of checking the existence of the last row in a range, we check it's value. This means that exactly one index record will be evaluated for each item within the group.

Note that the execution plan used by this query looks exactly the same as the first query's one. Same ref condition on item_id, same Using where; Using index. The only difference (not shown in the EXPLAIN output) it that the index is traversed in descending order now. Instead of fetching 1,000 records from the beginning to check the existence of the last one, we just take one record from the end and check its value.

Those who read my blog already know why do we use seemingly redundant item_id DESC in the ORDER BY clause here. To make MySQL to use descending index access path instead of a filesort in an ordered query, we should list all clauses that constitute the index, even if some of them are filtered by the WHERE condition.

The second query completes in only 10 ms which is 40 times as fast as the original query.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 22nd, 2010 at 11:00 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

EXPLAIN EXTENDED