EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

MySQL: Selecting records holding group-wise maximum (resolving ties)

Comments enabled. I *really* need your comment

Continuing the series on selecting records holding group-wise maximums.

The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.

If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of holding a group-wise maximum.

In this article, I'll show how to resolve the ties.

Let's recreate the table we used in the previous article:

Table creation details

This table has 1,000,000 records:

  • id is the PRIMARY KEY
  • orderer is filled with random values from 1 to 10
  • glow is a low cardinality grouping field (10 distinct values)
  • ghigh is a high cardinality grouping field (10,000 distinct values)
  • stuffing is an asterisk-filled VARCHAR(200) column added to emulate payload of the actual tables

Now, let's see how we deal with the different methods to resolve ties:

Returning all ties

Some tasks require all ties to be returned (that is, we need to return every record that holds the group-wise maximum or minimum).

This is simple and requires just a little modification to the query we used before. Instead of joining on the id, we just should join on the MAX or MIN value returned by the grouping query. This guarantees that every tie will be returned.

Here's the query to return all group-wise maximums with ties for a high cardinality grouper:

SELECT  COUNT(*) AS cnt, SUM(id) AS idsum
FROM    (
        SELECT  ghigh, MAX(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.ghigh = dd.ghigh
        AND di.orderer = dd.orderer

View query details

, and the same query for the low cardinality groupers:

SELECT  COUNT(*) AS cnt, SUM(id) AS idsum
FROM    (
        SELECT  glow, MAX(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.glow = dd.glow
        AND di.orderer = dd.orderer

View query details

The final results are aggregated to avoid the post bloating, but the principle is the same.

We see that the queries return the same results but have different execution times.

Due to the way the data are distributed in the table, each grouper (both ghigh and glow) has at least one record with orderer = 1. The group-wise minimum for each row is, therefore, 1 and the queries above, in fact, just return all records with orderer = 1.

MySQL does all joins using the nested loops.

To return the values for high cardinality groupers (those that the table has many) it should do 10,000 loose index scans to find the group-wise minimums, each followed by 10,000 tight index scans returning all records with orderer = 1 (about 11 records per scan in average). This results in 111,431 records being returned.

The query against the low cardinality groupers does the same, but it performs 10 loose index scans to find the minimums and does 10 tight index scans (one per grouper) to find the records that hold it, returning 11,111 records per group.

Tight index scans are sequential and loose index scan require searching the B-Tree. Finding lots of sequential values in few groups is much faster than finding few sequential values in lots of groups, though both algorithms ultimately return the same records.

That's why the query against the high cardinality groupers (lots of groups, few records per group) takes almost 450 ms, while the query against low cardinality groupers (few groups, lots of records per group) takes as few as 107 ms, or 4 times as fast.

Returning arbitrary record

In some occasions, we are not interested in the order of records within the tie: any single record would do as long as it holds the group-wise minimum or maximum.

This way we need to return any arbitrary record holding the group-wise maximum.

To do this, we can use the MySQL's extension which allows to use the non-grouped and non-aggregates columns in a GROUP BY query.

When used this way, the expression will return an arbitrary value taken from any record within the group. This was designed to make the GROUP BY queries applied to joins, when both joining and grouping are done on the PRIMARY KEY of a table participating in the query. The join will duplicate the rows from the table and the GROUP BY will split them across the groups according to the value of the PRIMARY KEY. Since the PRIMARY KEY uniquely defines a row, all columns of the table are guaranteed to have the same values across the group, and there is no need to select any special one: any will do.

This could be applied to out task as well:

SELECT  COUNT(*), SUM(id), MAX(orderer), MIN(orderer)
FROM    (
        SELECT  di.*
        FROM    (
                SELECT  ghigh, MIN(orderer) AS orderer
                FROM    t_distinct d
                GROUP BY
                        ghigh
                ) dd
        JOIN    t_distinct di
        ON      di.ghigh = dd.ghigh
                AND di.orderer = dd.orderer
        GROUP BY
                di.ghigh, di.orderer
        ) q

View query details

This solution is quite efficient for the high-cardinality groupers, but has a serious drawback.

The expressions returned by the GROUP BY extension are not guaranteed to originate from one record.

Imagine this query:

SELECT  grouper, column1, column2
FROM    mytable
GROUP BY
         grouper

against this table:

grouper column1 column2
1 10 100
1 20 200

This query is free to return any value of column1 and column2 within the group.

That means that this resultset:

grouper column1 column2
1 10 200

would be a correct output for this query, with the values of column1 and column2 being taken from two different records (within a single group).

Though in reality MySQL returns the values from a single record, it's a bad practice to rely on this behavior which can be changed with any new version or even a bugfix patch.

However, we can use a cleaner version of the query which would do just the same.

Instead of grouping on the non-primary key values, we can just select the primary key of an arbitrary record and join the table back on it.

To select a primary key of an arbitrary record we can just use a subquery with a LIMIT 1 applied to it.

Here's the query for the high cardinality grouper:

SELECT  COUNT(*), SUM(id), MAX(orderer), MIN(orderer)
FROM    (
        SELECT  di.*
        FROM    (
                SELECT  ghigh, MIN(orderer) AS orderer
                FROM    t_distinct d
                GROUP BY
                        ghigh
                ) dd
        JOIN    t_distinct di
        ON      di.id = 
                (
                SELECT  id
                FROM    t_distinct ds
                WHERE   ds.ghigh = dd.ghigh
                        AND ds.orderer = dd.orderer
                LIMIT 1
                )        
        ) q

View query details

and the same query for the low cardinality grouper:

SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id = 
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        LIMIT 1
        )
id orderer ghigh glow
11 1 11 1
382 1 382 2
3 1 3 3
44 1 44 4
55 1 55 5
26 1 26 6
7 1 7 7
118 1 118 8
19 1 19 9
150 1 150 10
10 rows fetched in 0.0004s (0.0032s)

View query details

The low cardinality query returns much less columns and is almost instant (3 ms).

The high cardinality columns requires more index lookups that the previous GROUP BY version (to find the group-wise minimums; to find the id of an arbitrary record; to join the table back on this id). However, it does not use filesort.

This makes it just a little bit less efficient than a GROUP BY query (570 ms vs 560 ms), but it is guaranteed to return correct results, and, hence, should be used instead of the GROUP BY version.

Returning a record with maximum id

Finally, let's make a query that would resolve the ties by returning the column with the maximum or the minimum id (or another unique column).

In this case we cannot get away with only using a MIN or a MAX, since we do need to sort on a composite condition, and MAX and MIN only accept and return a single value.

To do this, we need to use a subquery returning an id, just like we did in the previous case. The only thing we need to change is to add a correct ORDER BY condition to this subquery.

To make it more interesting, let's create a query that would return the records holding the group-wise minimums of orderer, but would resolve ties by selecting a maximal id.

SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id = 
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        ORDER BY
                id DESC
        LIMIT 1
        )

View query details

This works, but is quite inefficient (380 ms).

This is due to the flaw in MySQL's optimizer.

As you can see, the subquery has a filtering equality condition on two columns (glow and orderer) and an ordering condition on the third column, id.

There is an index that covers all three columns: ix_distinct_glow_orderer (glow, orderer). The first two columns are explicitly provided in the index definition, and the third column, id, being a PRIMARY KEY, is implicitly included by the InnoDB engine as a row pointer.

However, MySQL is able to use this index for ordering in a plain query but not in a subquery. It does use the index for filtering but orders using a filesort.

To work around this, we need to modify the ordering condition so that it would include all three columns constituting the index, all in the same direction:

SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id = 
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        ORDER BY
                glow DESC, orderer DESC, id DESC
        LIMIT 1
        )

View query details

Now there is no filesort in the DEPENDENT SUBQUERY, and the whole query completes in only 3 ms (100 times as fast).

The use of orderer DESC in the ordering condition of the subquery despite the fact that we need to return the group-wise minimum, not maximum may seem counterintuitive. However, this is only a workaround to make the ordering condition eligible for the index. The actual value of the orderer used the subquery is defined by the value of MAX or MIN returned by the outer query and the records are filtered on it anyway. Logically, orderer ASC, orderer DESC and even the absence of the orderer in the ORDER BY are the same. It's just required to make MySQL use the index, and to do this we need to sort on all columns constituting the index, even though the first two ordering conditions are made constant by the filter and could be omitted as well.

To be continued.

Written by Quassnoi

November 25th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply