EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

MySQL: Selecting records holding group-wise maximum (on a unique column)

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Juan asks:

Regarding this question:

I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.

I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more efficient.

A nice question, this is a common problem for most developers.

For those too lazy to follow the link, the problem in one sentence:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

I've already covered this in passing in various previous articles in my blog, and now I'll try to make a little summary.

According to the SQL standards, every expression used in the SELECT list of a query that deals with GROUP BY should be either grouped by (that is, used in the GROUP BY) or aggregated.

The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.

The grouping expressions, of course, share their values across all the records constituting the group (since they are what that makes the group a group), and hence can be used as is.

Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.

The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a record in an aggregate query: you can only select a tuple of aggregated values. It's possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of column being strictly positive, you can never find a record holding a SUM(column): it will be greater than any of the values of column.

However, with MIN or MAX, it is guaranteed that at least one record holds the non-NULL value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.

As SQL evolved, different solutions had been proposed to solve this problem, and I'll try to cover these solutions in this series of articles.

First, I'll cover MySQL (as per request)

MySQL's dialect of SQL is much more poor than those used by the other systems. That's why most of the solutions used by the other systems, like analytic functions, DISTINCT ON etc. cannot be used in MySQL.

However, MySQL still provides some ways to make the queries that solve this task more efficient.

Let's create a sample table:

Table creation details

This table has 1,000,000 records:

  • id is the PRIMARY KEY
  • orderer is filled with random values from 1 to 10
  • glow is a low cardinality grouping field (10 distinct values)
  • ghigh is a high cardinality grouping field (10,000 distinct values)

There are composite indexes on grouping columns: those including the id as a secondary field and those including the orderer.

We will split the task into four large categories which depend on properties the column we maximize (or minimize) on and how should we process ties.

The column being maximized is unique

This means that the ties are impossible: the value of the column is guaranteed to identify the row (even outside the group).

All we have to do is to select the greatest value of the column and join the table back on this value.

Here's how we do it in MySQL:

SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MAX(id) AS id
        FROM    t_distinct d
        GROUP BY
                glow        
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id

View query details

In the query plan we see Using index for group-by.

This access method is used by MySQL to serve DISTINCT and GROUP BY queries, or, being more exact, to build the list of distinct groupers using an index.

The idea behind the method is very simple:

  1. Find the first key for the index and return it
  2. Iteratively find the next key (greater than the one previously found) traversing the index tree and return it

Whenever the next index key is found, the aggregates are applied to that key.

The indexes in MySQL are built using a B-Tree. This means that the iterative searching will have to restart from the root key on each key. The B-Tree is O(log(n)) levels deep in average, this means that the index scan for group by is O(log(n)) times as long as the index scan on these value would take be the values unique.

Since the aggregates normally require all values from the group, there is no point in using this access method to calculate the aggregates: a sequential index scan would be way more efficient (without O(log(n)) in the formula).

MIN and MAX, however, are somewhat special aggregates, since aggregating a column with them means just taking a first (or a last) value from the appropriate index. MySQL is aware of this and uses the index for group-by when calculating MIN and MAX on secondary index columns. This is called loose index scan.

The index for group-by, when used over a composite index, naturally stops on the lowest group-wise value of the second column. The MIN, therefore, is free: the value of the second column (which constitutes the MIN) should be just returned along with the value of the grouping column.

A MAX, on the other hand, requires an additional index search to find the last value of the group.

Overall query time, thus, depends on the grouping column cardinality and on the aggregate function (MIN or MAX being used). The lower the cardinality is, the better; and MIN is faster than MAX.

We see that on a low cardinality column, glow, which has only 10 distinct values, the query takes only 2 ms, which is next instant. We can hardly notice the difference between the MIN and MAX on such a fast query.

Let's see the difference between two queries being grouped on a high cardinality column. Since the query would return 10,000 records, I'll make it aggregate the final results as well to avoid bloating the post.

Here's the MIN query:

SELECT  SUM(di.id)
FROM    (
        SELECT  ghigh, MIN(id) AS id
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id
SUM(di.id)
50005000
1 row fetched in 0.0001s (0.1638s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 10000 100.00
1 PRIMARY di eq_ref PRIMARY PRIMARY 4 dd.id 1 100.00 Using index
2 DERIVED d range ix_distinct_ghigh_id 4 20835 100.00 Using index for group-by
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)

, and here's the MAX one:

SELECT  SUM(di.id)
FROM    (
        SELECT  ghigh, MAX(id) AS id
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id
SUM(di.id)
9950005000
1 row fetched in 0.0001s (0.2800s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 10000 100.00
1 PRIMARY di eq_ref PRIMARY PRIMARY 4 dd.id 1 100.00 Using index
2 DERIVED d range ix_distinct_ghigh_id 4 20835 100.00 Using index for group-by
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)

With the grouping column having high cardinality, this query needs much more next key lookups to complete and takes significantly more time. However, on a 1,000,000 records table, it's still a matter of fractions of a second. The MAX query takes 280 ms, which is almost twice the time of the MIN query (160 ms).

As was just described, the reason for that is that the MAX query need an extra lookup, since the records being iterated by index for group-by hold the MIN, not the MAX.

Unfortunately, MySQL cannot iterate the index backwards in a loose index scan, nor it supports the DESC clause in the indexes. That means that taking the group-wise maximum is less efficient than taking a group-wise minimum (with a loose index scan being used).

However, with the grouping columns having sufficiently low cardinality, both queries are very fast and the difference negligible.

To be continued

Written by Quassnoi

November 24th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply