EXPLAIN EXTENDED

How to create fast database queries

Aggregates: subqueries vs. GROUP BY

with one comment

From Stack Overflow:

I have a table users and there is a field invited_by_id showing user id of the person who invited this user.

I need to make a MySQL query returning rows with all the fields from users plus a invites_count field showing how many people were invited by each user.

The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.

GROUP BY

The first approach is using GROUP BY:

SELECT  u.*, COUNT(ui.id)
FROM    users u
LEFT JOIN
        users ui
ON      ui.invited_by = u.id
GROUP BY
        u.id

This is a GROUP BY with a self join, very simple. There are only two little things I'd like to pay some attention to.

First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).

Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).

This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.

Subquery

This solution involves correlated subqueries:

SELECT  u.*,
        (
        SELECT  COUNT(*)
        FROM    users ui
        WHERE   ui.invited_by = u.id
        )
FROM    users u

Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.

Comparison

Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:

Table creation details

There are two identical tables, one using MyISAM storage engine, the other one using InnoDB. Each table contains 100,000 records.

The field invited_by is filled with a random value from 0 (including) to current id (not including). 0 means that the user was not invited (and really, who could invite the first user?).

All records

We well calculate the number of members invited by each user. For the sake of brevity we will aggregate the returned records.

MyISAM

GROUP BY

Subquery

We see that subquery solution is much more efficient: 3 seconds against 14 seconds, almost 5 times.

What is the reason?

If we look into the query plans we will see that MySQL uses temporary and filesort for the GROUP BY query.

MySQL knows only one way to aggregate the records, namely, sorting. To execute any GROUP BY statement it should first order the records according to GROUP BY conditions. This naturally returns records sorted by GROUP BY expressions and MySQL even cared to document this behavior.

There is a PRIMARY KEY index on id field, but traversing the indexes is quite slow in MySQL, so the optimizer preferred to do a filesort.

The subqueries, on the other hand, do not require any additional sorting, so retrieving the values aggregated in the subqueries is much faster. The records, of course, will not return in any specific order, but it wasn't required anyway.

InnoDB

GROUP BY

Subquery

With InnoDB, both queries complete in almost same time, but the GROUP BY query is still a little bit faster.

Unlike MyISAM, with InnoDB tables the optimizer chooses the index access path which avoids GROUP BY sorting.

This is because InnoDB tables are index-organized, and the PRIMARY KEY is the table itself. So the table scan and the PRIMARY KEY scan are in fact the same thing in InnoDB and there is no point in additional sorting.

The algorithms behind the LEFT JOIN and the subqueries are in fact the same: just a single index range scan. However, due to some implementation issues, the subquery access requires some additional overhead which makes the subquery to run about 15% longer than its LEFT JOIN counterpart.

LIMIT with index

Let's try the same queries, but now just return the first 100 records.

MyISAM

GROUP BY

Subquery

In this case, optimizer makes the query to use the index even despite the fact that the table uses MyISAM, because LIMIT makes index traversal cheaper than sorting.

However, for some strange reason, the engine still sorts the final resultset (which returns already sorted from the GROUP BY), and the GROUP BY query is 3 times less efficient than the subqueries.

InnoDB

GROUP BY

Subquery

With InnoDB tables, optimizer makes no final filesort, so both solutions take almost the same time (albeit the subqueries are several percent less efficient again).

LIMIT without index

Now, let's run the queries that limit the resultsets ordered by an unindexed expression. Usually, this means ordering by RAND() to show, say, 10 random users.

MyISAM

GROUP BY

Subquery

The subqueries are notably faster (2 seconds for the subqueries against 4 seconds for the GROUP BY).

This demonstrates an interesting flaw in MySQL optimizer algorithm.

We remember that MySQL used the table scan with a sort for a query without a LIMIT, and an index scan for a query using a LIMIT.

In this case, MySQL's optimizer sees two clauses: GROUP BY and LIMIT, and there is an index avaiable for GROUP BY id.

Taken together, they should make the index access path more efficient, and as we already saw in the previous section, they do, since normally LIMIT just takes first 10 records from the index.

But there is a third clause here, ORDER BY.

LIMIT of course gets applied to ORDER BY sorting, not GROUP BY one. MySQL's optimizer, however, does not take this into account. Despite the fact that the ORDER BY will need the whole recordset, the index access path is still used. This makes GROUP BY to traverse all index. Of course no sorting is done (for GROUP BY that is), but traversal itself is quite slow in MyISAM.

The subquery solution, on the other hand, has only the ORDER BY and LIMIT. This makes the optimizer to choose the table scan which is much faster when we need all records.

InnoDB

GROUP BY

Subqueries

Formally, InnoDB is subject to the same optimizer mistake as the one described above for MyISAM. For the GROUP BY, we see the index access path in the query plan, for subqueries, we see ALL.

But for InnoDB, these two access paths are in fact the same, since an InnoDB table is a PRIMARY KEY and the index traversal over a PRIMARY KEY is a table traversal.

So unlike MyISAM, nothing bad happens in the first case here, and the GROUP BY query runs a little bit faster than the subqueries.

Summary

Generally speaking, aggregates over a left-joined table grouped by the main table's primary key, and the aggregate subqueries over the same table yield the same results and use almost the same algorithms in MySQL.

However, they still differ performance-wise.

On the one hand, the subqueries require some overhead and are executed several percents more slowly than the joins.

On the other hand, GROUP BY in MySQL requires sorting the joined recordset on the GROUP BY expressions.

For InnoDB, the optimizer mostly makes the optimal decisions and does not sort the recordset, since an InnoDB table is always ordered by the PRIMARY KEY and the records naturally come in that order out of the table. The optimizer is aware of that.

For MyISAM, the PRIMARY KEY index does not contain the table values. The table needs to be looked up which requires extra work. That's why in MyISAM the optimizer often makes incorrect decisions about whether or not use the index sort order or sort records taken from the table. These incorrect decisions lead to the subqueries being more efficient.

For MyISAM tables, the subqueries are often a better alternative to the GROUP BY.

For InnoDB tables, the subqueries and the GROUP BY complete in almost same time, but GROUP BY is still several percent more efficient.

With InnoDB, the GROUP BY queries over a left-joined table should be preferred over running the aggregate subqueries.

Written by Quassnoi

January 30th, 2010 at 11:00 pm

Posted in MySQL

One Response to 'Aggregates: subqueries vs. GROUP BY'

Subscribe to comments with RSS

  1. thanks a lot very helpful, even though i’m using postgres it gives me an idea where to start

    ali

    30 Sep 14 at 20:11

Leave a Reply