EXPLAIN EXTENDED

How to create fast database queries

Archive for January 30th, 2010

Aggregates: subqueries vs. GROUP BY

with one comment

From Stack Overflow:

I have a table users and there is a field invited_by_id showing user id of the person who invited this user.

I need to make a MySQL query returning rows with all the fields from users plus a invites_count field showing how many people were invited by each user.

The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.

GROUP BY

The first approach is using GROUP BY:

SELECT  u.*, COUNT(ui.id)
FROM    users u
LEFT JOIN
        users ui
ON      ui.invited_by = u.id
GROUP BY
        u.id

This is a GROUP BY with a self join, very simple. There are only two little things I'd like to pay some attention to.

First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).

Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).

This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.

Subquery

This solution involves correlated subqueries:

SELECT  u.*,
        (
        SELECT  COUNT(*)
        FROM    users ui
        WHERE   ui.invited_by = u.id
        )
FROM    users u

Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.

Comparison

Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Read the rest of this entry »

Written by Quassnoi

January 30th, 2010 at 11:00 pm

Posted in MySQL