MySQL: Selecting records holding group-wise maximum (on a unique column)
Answering questions asked on the site.
Juan asks:
Regarding this question:
I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.
I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more efficient.
A nice question, this is a common problem for most developers.
For those too lazy to follow the link, the problem in one sentence:
How do I select the whole records, grouped on
grouper
and holding a group-wise maximum (or minimum) on other column?
I've already covered this in passing in various previous articles in my blog, and now I'll try to make a little summary.
According to the SQL standards, every expression used in the SELECT
list of a query that deals with GROUP BY
should be either grouped by (that is, used in the GROUP BY
) or aggregated.
The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.
The grouping expressions, of course, share their values across all the records constituting the group (since they are what that makes the group a group), and hence can be used as is.
Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.
The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a record in an aggregate query: you can only select a tuple of aggregated values. It's possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of column
being strictly positive, you can never find a record holding a SUM(column)
: it will be greater than any of the values of column
.
However, with MIN
or MAX
, it is guaranteed that at least one record holds the non-NULL
value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.
As SQL evolved, different solutions had been proposed to solve this problem, and I'll try to cover these solutions in this series of articles.
First, I'll cover MySQL (as per request)
MySQL's dialect of SQL is much more poor than those used by the other systems. That's why most of the solutions used by the other systems, like analytic functions, DISTINCT ON
etc. cannot be used in MySQL.
However, MySQL still provides some ways to make the queries that solve this task more efficient.
Let's create a sample table:
Table creation details
This table has 1,000,000 records:
id
is the PRIMARY KEYorderer
is filled with random values from 1 to 10glow
is a low cardinality grouping field (10 distinct values)ghigh
is a high cardinality grouping field (10,000 distinct values)
There are composite indexes on grouping columns: those including the id
as a secondary field and those including the orderer
.
We will split the task into four large categories which depend on properties the column we maximize (or minimize) on and how should we process ties.
The column being maximized is unique
This means that the ties are impossible: the value of the column is guaranteed to identify the row (even outside the group).
All we have to do is to select the greatest value of the column and join the table back on this value.
Here's how we do it in MySQL:
SELECT di.id, di.orderer, di.ghigh, di.glow FROM ( SELECT glow, MAX(id) AS id FROM t_distinct d GROUP BY glow ) dd JOIN t_distinct di ON di.id = dd.id
In the query plan we see Using index for group-by
.
This access method is used by MySQL to serve DISTINCT
and GROUP BY
queries, or, being more exact, to build the list of distinct groupers using an index.
The idea behind the method is very simple:
- Find the first key for the index and return it
- Iteratively find the next key (greater than the one previously found) traversing the index tree and return it
Whenever the next index key is found, the aggregates are applied to that key.
The indexes in MySQL are built using a B-Tree. This means that the iterative searching will have to restart from the root key on each key. The B-Tree
is O(log(n))
levels deep in average, this means that the index scan for group by
is O(log(n))
times as long as the index scan on these value would take be the values unique.
Since the aggregates normally require all values from the group, there is no point in using this access method to calculate the aggregates: a sequential index scan would be way more efficient (without O(log(n))
in the formula).
MIN
and MAX
, however, are somewhat special aggregates, since aggregating a column with them means just taking a first (or a last) value from the appropriate index. MySQL is aware of this and uses the index for group-by
when calculating MIN
and MAX
on secondary index columns. This is called loose index scan.
The index for group-by
, when used over a composite index, naturally stops on the lowest group-wise value of the second column. The MIN
, therefore, is free: the value of the second column (which constitutes the MIN
) should be just returned along with the value of the grouping column.
A MAX
, on the other hand, requires an additional index search to find the last value of the group.
Overall query time, thus, depends on the grouping column cardinality and on the aggregate function (MIN
or MAX
being used). The lower the cardinality is, the better; and MIN
is faster than MAX
.
We see that on a low cardinality column, glow
, which has only 10 distinct values, the query takes only 2 ms, which is next instant. We can hardly notice the difference between the MIN
and MAX
on such a fast query.
Let's see the difference between two queries being grouped on a high cardinality column. Since the query would return 10,000 records, I'll make it aggregate the final results as well to avoid bloating the post.
Here's the MIN
query:
SELECT SUM(di.id) FROM ( SELECT ghigh, MIN(id) AS id FROM t_distinct d GROUP BY ghigh ) dd JOIN t_distinct di ON di.id = dd.id
SUM(di.id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50005000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.1638s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 10000 | 100.00 | |||||
1 | PRIMARY | di | eq_ref | PRIMARY | PRIMARY | 4 | dd.id | 1 | 100.00 | Using index |
2 | DERIVED | d | range | ix_distinct_ghigh_id | 4 | 20835 | 100.00 | Using index for group-by |
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)
, and here's the MAX
one:
SELECT SUM(di.id) FROM ( SELECT ghigh, MAX(id) AS id FROM t_distinct d GROUP BY ghigh ) dd JOIN t_distinct di ON di.id = dd.id
SUM(di.id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9950005000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.2800s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 10000 | 100.00 | |||||
1 | PRIMARY | di | eq_ref | PRIMARY | PRIMARY | 4 | dd.id | 1 | 100.00 | Using index |
2 | DERIVED | d | range | ix_distinct_ghigh_id | 4 | 20835 | 100.00 | Using index for group-by |
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)
With the grouping column having high cardinality, this query needs much more next key lookups
to complete and takes significantly more time. However, on a 1,000,000 records table, it's still a matter of fractions of a second. The MAX
query takes 280 ms, which is almost twice the time of the MIN
query (160 ms).
As was just described, the reason for that is that the MAX
query need an extra lookup, since the records being iterated by index for group-by
hold the MIN
, not the MAX
.
Unfortunately, MySQL cannot iterate the index backwards in a loose index scan, nor it supports the DESC
clause in the indexes. That means that taking the group-wise maximum is less efficient than taking a group-wise minimum (with a loose index scan being used).
However, with the grouping columns having sufficiently low cardinality, both queries are very fast and the difference negligible.
To be continued
Ahoy!
It’s great to see how detailed you explain your ideas and thoughts here. But maybe – just as an maybe useful suggestion! – the following links may inspire you?
http://dev.mysql.com/doc/refman/5.0/en/example-maximum-column-group-row.html
http://jan.kneschke.de/2007/2/15/groupwise-max/
afaik both also work with columns holding unique values.
Kind regards,
Me (RĂ¼diger from Germany)
IBeRKa
7 Feb 15 at 10:30
Hey,
Those are nice articles, thanks.
Quassnoi
7 Feb 15 at 14:02