Archive for January 30th, 2010
Aggregates: subqueries vs. GROUP BY
From Stack Overflow:
I have a table
usersand there is a fieldinvited_by_idshowing user id of the person who invited this user.I need to make a MySQL query returning rows with all the fields from users plus a
invites_countfield showing how many people were invited by each user.
The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.
GROUP BY
The first approach is using GROUP BY:
SELECT u.*, COUNT(ui.id)
FROM users u
LEFT JOIN
users ui
ON ui.invited_by = u.id
GROUP BY
u.id
This is a GROUP BY with a self join, very simple. There are only two little things I'd like to pay some attention to.
First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).
Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).
This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.
Subquery
This solution involves correlated subqueries:
SELECT u.*,
(
SELECT COUNT(*)
FROM users ui
WHERE ui.invited_by = u.id
)
FROM users u
Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.
Comparison
Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Read the rest of this entry »
Subscribe in a reader