MySQL: Selecting records holding group-wise maximum (resolving ties)
Continuing the series on selecting records holding group-wise maximums.
The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.
If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of holding a group-wise maximum
.
In this article, I'll show how to resolve the ties.
Let's recreate the table we used in the previous article:
Table creation details
This table has 1,000,000 records:
id
is thePRIMARY KEY
orderer
is filled with random values from 1 to 10glow
is a low cardinality grouping field (10 distinct values)ghigh
is a high cardinality grouping field (10,000 distinct values)stuffing
is an asterisk-filledVARCHAR(200)
column added to emulate payload of the actual tables
Now, let's see how we deal with the different methods to resolve ties:
Returning all ties
Some tasks require all ties to be returned (that is, we need to return every record that holds the group-wise maximum or minimum).
This is simple and requires just a little modification to the query we used before. Instead of joining on the id
, we just should join on the MAX
or MIN
value returned by the grouping query. This guarantees that every tie will be returned.
Here's the query to return all group-wise maximums with ties for a high cardinality grouper:
SELECT COUNT(*) AS cnt, SUM(id) AS idsum FROM ( SELECT ghigh, MAX(orderer) AS orderer FROM t_distinct d GROUP BY ghigh ) dd JOIN t_distinct di ON di.ghigh = dd.ghigh AND di.orderer = dd.orderer
, and the same query for the low cardinality groupers:
SELECT COUNT(*) AS cnt, SUM(id) AS idsum FROM ( SELECT glow, MAX(orderer) AS orderer FROM t_distinct d GROUP BY glow ) dd JOIN t_distinct di ON di.glow = dd.glow AND di.orderer = dd.orderer
The final results are aggregated to avoid the post bloating, but the principle is the same.
We see that the queries return the same results but have different execution times.
Due to the way the data are distributed in the table, each grouper (both ghigh
and glow
) has at least one record with orderer = 1
. The group-wise minimum for each row is, therefore, 1 and the queries above, in fact, just return all records with orderer = 1
.
MySQL does all joins using the nested loops.
To return the values for high cardinality groupers (those that the table has many) it should do 10,000 loose index scans to find the group-wise minimums, each followed by 10,000 tight index scans returning all records with orderer = 1
(about 11 records per scan in average). This results in 111,431 records being returned.
The query against the low cardinality groupers does the same, but it performs 10 loose index scans to find the minimums and does 10 tight index scans (one per grouper) to find the records that hold it, returning 11,111 records per group.
Tight index scans are sequential and loose index scan require searching the B-Tree. Finding lots of sequential values in few groups is much faster than finding few sequential values in lots of groups, though both algorithms ultimately return the same records.
That's why the query against the high cardinality groupers (lots of groups, few records per group) takes almost 450 ms, while the query against low cardinality groupers (few groups, lots of records per group) takes as few as 107 ms, or 4 times as fast.
Returning arbitrary record
In some occasions, we are not interested in the order of records within the tie: any single record would do as long as it holds the group-wise minimum or maximum.
This way we need to return any arbitrary record holding the group-wise maximum.
To do this, we can use the MySQL's extension which allows to use the non-grouped and non-aggregates columns in a GROUP BY
query.
When used this way, the expression will return an arbitrary value taken from any record within the group. This was designed to make the GROUP BY
queries applied to joins, when both joining and grouping are done on the PRIMARY KEY
of a table participating in the query. The join will duplicate the rows from the table and the GROUP BY
will split them across the groups according to the value of the PRIMARY KEY
. Since the PRIMARY KEY
uniquely defines a row, all columns of the table are guaranteed to have the same values across the group, and there is no need to select any special one: any will do.
This could be applied to out task as well:
SELECT COUNT(*), SUM(id), MAX(orderer), MIN(orderer) FROM ( SELECT di.* FROM ( SELECT ghigh, MIN(orderer) AS orderer FROM t_distinct d GROUP BY ghigh ) dd JOIN t_distinct di ON di.ghigh = dd.ghigh AND di.orderer = dd.orderer GROUP BY di.ghigh, di.orderer ) q
This solution is quite efficient for the high-cardinality groupers, but has a serious drawback.
The expressions returned by the GROUP BY
extension are not guaranteed to originate from one record.
Imagine this query:
SELECT grouper, column1, column2 FROM mytable GROUP BY grouper
against this table:
grouper | column1 | column2 |
---|---|---|
1 | 10 | 100 |
1 | 20 | 200 |
This query is free to return any value of column1
and column2
within the group.
That means that this resultset:
grouper | column1 | column2 |
---|---|---|
1 | 10 | 200 |
would be a correct output for this query, with the values of column1
and column2
being taken from two different records (within a single group).
Though in reality MySQL returns the values from a single record, it's a bad practice to rely on this behavior which can be changed with any new version or even a bugfix patch.
However, we can use a cleaner version of the query which would do just the same.
Instead of grouping on the non-primary key values, we can just select the primary key of an arbitrary record and join the table back on it.
To select a primary key of an arbitrary record we can just use a subquery with a LIMIT 1
applied to it.
Here's the query for the high cardinality grouper:
SELECT COUNT(*), SUM(id), MAX(orderer), MIN(orderer) FROM ( SELECT di.* FROM ( SELECT ghigh, MIN(orderer) AS orderer FROM t_distinct d GROUP BY ghigh ) dd JOIN t_distinct di ON di.id = ( SELECT id FROM t_distinct ds WHERE ds.ghigh = dd.ghigh AND ds.orderer = dd.orderer LIMIT 1 ) ) q
and the same query for the low cardinality grouper:
SELECT di.id, di.orderer, di.ghigh, di.glow FROM ( SELECT glow, MIN(orderer) AS orderer FROM t_distinct d GROUP BY glow ) dd JOIN t_distinct di ON di.id = ( SELECT id FROM t_distinct ds WHERE ds.glow = dd.glow AND ds.orderer = dd.orderer LIMIT 1 )
id | orderer | ghigh | glow | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 1 | 11 | 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
382 | 1 | 382 | 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 1 | 3 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
44 | 1 | 44 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
55 | 1 | 55 | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
26 | 1 | 26 | 6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | 1 | 7 | 7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
118 | 1 | 118 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
19 | 1 | 19 | 9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
150 | 1 | 150 | 10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 rows fetched in 0.0004s (0.0032s) |
The low cardinality query returns much less columns and is almost instant (3 ms).
The high cardinality columns requires more index lookups that the previous GROUP BY
version (to find the group-wise minimums; to find the id
of an arbitrary record; to join the table back on this id
). However, it does not use filesort
.
This makes it just a little bit less efficient than a GROUP BY
query (570 ms vs 560 ms), but it is guaranteed to return correct results, and, hence, should be used instead of the GROUP BY
version.
Returning a record with maximum id
Finally, let's make a query that would resolve the ties by returning the column with the maximum or the minimum id
(or another unique column).
In this case we cannot get away with only using a MIN
or a MAX
, since we do need to sort on a composite condition, and MAX
and MIN
only accept and return a single value.
To do this, we need to use a subquery returning an id
, just like we did in the previous case. The only thing we need to change is to add a correct ORDER BY
condition to this subquery.
To make it more interesting, let's create a query that would return the records holding the group-wise minimums of orderer
, but would resolve ties by selecting a maximal id
.
SELECT di.id, di.orderer, di.ghigh, di.glow FROM ( SELECT glow, MIN(orderer) AS orderer FROM t_distinct d GROUP BY glow ) dd JOIN t_distinct di ON di.id = ( SELECT id FROM t_distinct ds WHERE ds.glow = dd.glow AND ds.orderer = dd.orderer ORDER BY id DESC LIMIT 1 )
This works, but is quite inefficient (380 ms).
This is due to the flaw in MySQL's optimizer.
As you can see, the subquery has a filtering equality condition on two columns (glow
and orderer
) and an ordering condition on the third column, id
.
There is an index that covers all three columns: ix_distinct_glow_orderer (glow, orderer)
. The first two columns are explicitly provided in the index definition, and the third column, id
, being a PRIMARY KEY
, is implicitly included by the InnoDB engine as a row pointer.
However, MySQL is able to use this index for ordering in a plain query but not in a subquery. It does use the index for filtering but orders using a filesort
.
To work around this, we need to modify the ordering condition so that it would include all three columns constituting the index, all in the same direction:
SELECT di.id, di.orderer, di.ghigh, di.glow FROM ( SELECT glow, MIN(orderer) AS orderer FROM t_distinct d GROUP BY glow ) dd JOIN t_distinct di ON di.id = ( SELECT id FROM t_distinct ds WHERE ds.glow = dd.glow AND ds.orderer = dd.orderer ORDER BY glow DESC, orderer DESC, id DESC LIMIT 1 )
Now there is no filesort
in the DEPENDENT SUBQUERY
, and the whole query completes in only 3 ms (100 times as fast).
The use of orderer DESC
in the ordering condition of the subquery despite the fact that we need to return the group-wise minimum, not maximum may seem counterintuitive. However, this is only a workaround to make the ordering condition eligible for the index. The actual value of the orderer
used the subquery is defined by the value of MAX
or MIN
returned by the outer query and the records are filtered on it anyway. Logically, orderer ASC
, orderer DESC
and even the absence of the orderer
in the ORDER BY
are the same. It's just required to make MySQL use the index, and to do this we need to sort on all columns constituting the index, even though the first two ordering conditions are made constant by the filter and could be omitted as well.
To be continued.
Thank you for this cogent post which is exactly what I needed after spending a couple of days getting a solution for my own group-wise maximum problem working, but having no idea how to handle ties — which I really need to do.
Joining a “limit 1” relation to the grouped result is just the clever concept I was looking for. I look forward to trying it tomorrow morning first thing and seeing how it goes!
Ward Willats
30 May 14 at 10:47