MySQL: Selecting records holding group-wise maximum (on a unique column) at EXPLAIN EXTENDED

MySQL: Selecting records holding group-wise maximum (on a unique column)

Answering questions asked on the site.

Juan asks:

Regarding this question:

I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.

I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more efficient.

A nice question, this is a common problem for most developers.

For those too lazy to follow the link, the problem in one sentence:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

I've already covered this in passing in various previous articles in my blog, and now I'll try to make a little summary.

According to the SQL standards, every expression used in the SELECT list of a query that deals with GROUP BY should be either grouped by (that is, used in the GROUP BY) or aggregated.

The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.

The grouping expressions, of course, share their values across all the records constituting the group (since they are what that makes the group a group), and hence can be used as is.

Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.

The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a record in an aggregate query: you can only select a tuple of aggregated values. It's possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of column being strictly positive, you can never find a record holding a SUM(column): it will be greater than any of the values of column.

However, with MIN or MAX, it is guaranteed that at least one record holds the non-NULL value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.

As SQL evolved, different solutions had been proposed to solve this problem, and I'll try to cover these solutions in this series of articles.

First, I'll cover MySQL (as per request)

MySQL's dialect of SQL is much more poor than those used by the other systems. That's why most of the solutions used by the other systems, like analytic functions, DISTINCT ON etc. cannot be used in MySQL.

However, MySQL still provides some ways to make the queries that solve this task more efficient.

Let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_distinct (
id INT NOT NULL PRIMARY KEY,
orderer INT NOT NULL,
glow INT NOT NULL,
ghigh INT NOT NULL,
stuffing VARCHAR(200) NOT NULL,
KEY ix_distinct_glow_id (glow, id),
KEY ix_distinct_ghigh_id (ghigh, id),
KEY ix_distinct_glow_orderer (glow, orderer),
KEY ix_distinct_ghigh_orderer (ghigh, orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_distinct (id, orderer, glow, ghigh, stuffing)
SELECT  id, FLOOR(RAND(20091124) * 9) + 1,
        (id - 1) % 10 + 1,
        (id - 1) % 10000 + 1,
        LPAD('', 200, '*')
FROM    filler;

This table has 1,000,000 records:

id is the PRIMARY KEY
orderer is filled with random values from 1 to 10
glow is a low cardinality grouping field (10 distinct values)
ghigh is a high cardinality grouping field (10,000 distinct values)

There are composite indexes on grouping columns: those including the id as a secondary field and those including the orderer.

We will split the task into four large categories which depend on properties the column we maximize (or minimize) on and how should we process ties.

The column being maximized is unique

This means that the ties are impossible: the value of the column is guaranteed to identify the row (even outside the group).

All we have to do is to select the greatest value of the column and join the table back on this value.

Here's how we do it in MySQL:

SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
SELECT  glow, MAX(id) AS id
FROM    t_distinct d
GROUP BY
glow
) dd
JOIN    t_distinct di
ON      di.id = dd.id

View query details

id	orderer	ghigh	glow
999991	8	9991	1
999992	9	9992	2
999993	5	9993	3
999994	7	9994	4
999995	9	9995	5
999996	6	9996	6
999997	2	9997	7
999998	1	9998	8
999999	8	9999	9
1000000	1	10000	10
10 rows fetched in 0.0004s (0.0029s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					10	100.00
1	PRIMARY	di	eq_ref	PRIMARY	PRIMARY	4	dd.id	1	100.00
2	DERIVED	d	range		ix_distinct_glow_id	4		19	100.00	Using index for group-by

select `20091124_groupwise`.`di`.`id` AS `id`,`20091124_groupwise`.`di`.`orderer` AS `orderer`,`20091124_groupwise`.`di`.`ghigh` AS `ghigh`,`20091124_groupwise`.`di`.`glow` AS `glow` from (select `20091124_groupwise`.`d`.`glow` AS `glow`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`glow`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)

In the query plan we see Using index for group-by.

This access method is used by MySQL to serve DISTINCT and GROUP BY queries, or, being more exact, to build the list of distinct groupers using an index.

The idea behind the method is very simple:

Find the first key for the index and return it
Iteratively find the next key (greater than the one previously found) traversing the index tree and return it

Whenever the next index key is found, the aggregates are applied to that key.

The indexes in MySQL are built using a B-Tree. This means that the iterative searching will have to restart from the root key on each key. The B-Tree is O(log(n)) levels deep in average, this means that the index scan for group by is O(log(n)) times as long as the index scan on these value would take be the values unique.

Since the aggregates normally require all values from the group, there is no point in using this access method to calculate the aggregates: a sequential index scan would be way more efficient (without O(log(n)) in the formula).

MIN and MAX, however, are somewhat special aggregates, since aggregating a column with them means just taking a first (or a last) value from the appropriate index. MySQL is aware of this and uses the index for group-by when calculating MIN and MAX on secondary index columns. This is called loose index scan.

The index for group-by, when used over a composite index, naturally stops on the lowest group-wise value of the second column. The MIN, therefore, is free: the value of the second column (which constitutes the MIN) should be just returned along with the value of the grouping column.

A MAX, on the other hand, requires an additional index search to find the last value of the group.

Overall query time, thus, depends on the grouping column cardinality and on the aggregate function (MIN or MAX being used). The lower the cardinality is, the better; and MIN is faster than MAX.

We see that on a low cardinality column, glow, which has only 10 distinct values, the query takes only 2 ms, which is next instant. We can hardly notice the difference between the MIN and MAX on such a fast query.

Let's see the difference between two queries being grouped on a high cardinality column. Since the query would return 10,000 records, I'll make it aggregate the final results as well to avoid bloating the post.

Here's the MIN query:

SELECT  SUM(di.id)
FROM    (
SELECT  ghigh, MIN(id) AS id
FROM    t_distinct d
GROUP BY
ghigh
) dd
JOIN    t_distinct di
ON      di.id = dd.id

SUM(di.id)
50005000
1 row fetched in 0.0001s (0.1638s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					10000	100.00
1	PRIMARY	di	eq_ref	PRIMARY	PRIMARY	4	dd.id	1	100.00	Using index
2	DERIVED	d	range		ix_distinct_ghigh_id	4		20835	100.00	Using index for group-by

select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)

, and here's the MAX one:

SELECT  SUM(di.id)
FROM    (
SELECT  ghigh, MAX(id) AS id
FROM    t_distinct d
GROUP BY
ghigh
) dd
JOIN    t_distinct di
ON      di.id = dd.id

SUM(di.id)
9950005000
1 row fetched in 0.0001s (0.2800s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					10000	100.00
1	PRIMARY	di	eq_ref	PRIMARY	PRIMARY	4	dd.id	1	100.00	Using index
2	DERIVED	d	range		ix_distinct_ghigh_id	4		20835	100.00	Using index for group-by

select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)

With the grouping column having high cardinality, this query needs much more next key lookups to complete and takes significantly more time. However, on a 1,000,000 records table, it's still a matter of fractions of a second. The MAX query takes 280 ms, which is almost twice the time of the MIN query (160 ms).

As was just described, the reason for that is that the MAX query need an extra lookup, since the records being iterated by index for group-by hold the MIN, not the MAX.

Unfortunately, MySQL cannot iterate the index backwards in a loose index scan, nor it supports the DESC clause in the indexes. That means that taking the group-wise maximum is less efficient than taking a group-wise minimum (with a loose index scan being used).

However, with the grouping columns having sufficiently low cardinality, both queries are very fast and the difference negligible.

To be continued

Written by Quassnoi

November 24th, 2009 at 11:00 pm

Posted in MySQL

2 Responses to 'MySQL: Selecting records holding group-wise maximum (on a unique column)'

Subscribe to comments with RSS

Ahoy!

It’s great to see how detailed you explain your ideas and thoughts here. But maybe – just as an maybe useful suggestion! – the following links may inspire you?

http://dev.mysql.com/doc/refman/5.0/en/example-maximum-column-group-row.html

http://jan.kneschke.de/2007/2/15/groupwise-max/

afaik both also work with columns holding unique values.

Kind regards,
Me (Rüdiger from Germany)

IBeRKa

7 Feb 15 at 10:30
Hey,

Those are nice articles, thanks.

Quassnoi

7 Feb 15 at 14:02

EXPLAIN EXTENDED