Aggregates: subqueries vs. GROUP BY at EXPLAIN EXTENDED

I have a table users and there is a field invited_by_id showing user id of the person who invited this user.

I need to make a MySQL query returning rows with all the fields from users plus a invites_count field showing how many people were invited by each user.

The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.

GROUP BY

The first approach is using GROUP BY:

SELECT  u.*, COUNT(ui.id)
FROM    users u
LEFT JOIN
        users ui
ON      ui.invited_by = u.id
GROUP BY
        u.id

This is a GROUP BY with a self join, very simple. There are only two little things I'd like to pay some attention to.

First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).

Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).

This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.

Subquery

This solution involves correlated subqueries:

SELECT  u.*,
        (
        SELECT  COUNT(*)
        FROM    users ui
        WHERE   ui.invited_by = u.id
        )
FROM    users u

Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.

Comparison

Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_innodb_user (
        id INT NOT NULL PRIMARY KEY,
        invited_by INT NOT NULL,
        username VARCHAR(30) NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_user_invitedby (invited_by)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_myisam_user (
        id INT NOT NULL PRIMARY KEY,
        invited_by INT NOT NULL,
        username VARCHAR(30) NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_user_invitedby (invited_by)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;


DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_innodb_user
SELECT  id, FLOOR(RAND(20100130) * id), CONCAT('User ', id), RPAD('', 100, '*')
FROM    filler;

INSERT
INTO    t_myisam_user
SELECT  *
FROM    t_innodb_user;
&#91;/sourcecode&#93;
</div>

There are two identical tables, one using <strong>MyISAM</strong> storage engine, the other one using <strong>InnoDB</strong>. Each table contains <strong>100,000</strong> records.

The field <code>invited_by</code> is filled with a random value from <strong>0</strong> (including) to current <code>id</code> (not including). <strong>0</strong> means that the user was not invited (and really, who could invite the first user?).

<h3>All records</h3>

We well calculate the number of members invited by each user. For the sake of brevity we will aggregate the returned records.

<h4>MyISAM</h4>

<a href="#" onclick="xcollapse('X3621');return false;"><strong>GROUP BY</strong></a>
<br />
<div id="X3621" style="display: none; ">

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_myisam_user uo
        LEFT JOIN
                t_myisam_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100000	10000000	99986
1 row fetched in 0.0004s (13.9529s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100000	100.00
2	DERIVED	uo	ALL					100000	100.00	Using temporary; Using filesort
2	DERIVED	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00

select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`

Subquery

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_myisam_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_myisam_user uo
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100000	10000000	99986
1 row fetched in 0.0004s (2.9531s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100000	100.00
2	DERIVED	uo	ALL					100000	100.00
3	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo`) `q`

We see that subquery solution is much more efficient: 3 seconds against 14 seconds, almost 5 times.

What is the reason?

If we look into the query plans we will see that MySQL uses temporary and filesort for the GROUP BY query.

MySQL knows only one way to aggregate the records, namely, sorting. To execute any GROUP BY statement it should first order the records according to GROUP BY conditions. This naturally returns records sorted by GROUP BY expressions and MySQL even cared to document this behavior.

There is a PRIMARY KEY index on id field, but traversing the indexes is quite slow in MySQL, so the optimizer preferred to do a filesort.

The subqueries, on the other hand, do not require any additional sorting, so retrieving the values aggregated in the subqueries is much faster. The records, of course, will not return in any specific order, but it wasn't required anyway.

InnoDB

GROUP BY

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_innodb_user uo
        LEFT JOIN
                t_innodb_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100000	10000000	99986
1 row fetched in 0.0004s (4.2812s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100000	100.00
2	DERIVED	uo	index		PRIMARY	4		100076	100.00
2	DERIVED	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`

Subquery

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_innodb_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_innodb_user uo
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100000	10000000	99986
1 row fetched in 0.0004s (4.9530s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100000	100.00
2	DERIVED	uo	ALL					100191	100.00
3	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo`) `q`

With InnoDB, both queries complete in almost same time, but the GROUP BY query is still a little bit faster.

Unlike MyISAM, with InnoDB tables the optimizer chooses the index access path which avoids GROUP BY sorting.

This is because InnoDB tables are index-organized, and the PRIMARY KEY is the table itself. So the table scan and the PRIMARY KEY scan are in fact the same thing in InnoDB and there is no point in additional sorting.

The algorithms behind the LEFT JOIN and the subqueries are in fact the same: just a single index range scan. However, due to some implementation issues, the subquery access requires some additional overhead which makes the subquery to run about 15% longer than its LEFT JOIN counterpart.

LIMIT with index

Let's try the same queries, but now just return the first 100 records.

MyISAM

GROUP BY

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_myisam_user uo
        LEFT JOIN
                t_myisam_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        LIMIT 100
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100	10000	811
1 row fetched in 0.0003s (0.0315s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100	100.00
2	DERIVED	uo	index		PRIMARY	4		100000	100.00	Using filesort
2	DERIVED	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00

select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`

Subquery

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_myisam_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_myisam_user uo
        ORDER BY
                uo.id
        LIMIT 100
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100	10000	811
1 row fetched in 0.0004s (0.0103s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100	100.00
2	DERIVED	uo	index		PRIMARY	4		100	100000.00
3	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`

In this case, optimizer makes the query to use the index even despite the fact that the table uses MyISAM, because LIMIT makes index traversal cheaper than sorting.

However, for some strange reason, the engine still sorts the final resultset (which returns already sorted from the GROUP BY), and the GROUP BY query is 3 times less efficient than the subqueries.

InnoDB

GROUP BY

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_innodb_user uo
        LEFT JOIN
                t_innodb_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        LIMIT 100
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100	10000	811
1 row fetched in 0.0004s (0.0091s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100	100.00
2	DERIVED	uo	index		PRIMARY	4		100076	100.00
2	DERIVED	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`

Subquery

SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_innodb_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_innodb_user uo
        ORDER BY
                uo.id
        LIMIT 100
        ) q

COUNT(*)	SUM(LENGTH(stuffing))	SUM(cnt)
100	10000	811
1 row fetched in 0.0004s (0.0097s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100	100.00
2	DERIVED	uo	index		PRIMARY	4		100	100076.00
3	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`

With InnoDB tables, optimizer makes no final filesort, so both solutions take almost the same time (albeit the subqueries are several percent less efficient again).

LIMIT without index

Now, let's run the queries that limit the resultsets ordered by an unindexed expression. Usually, this means ordering by RAND() to show, say, 10 random users.

MyISAM

GROUP BY

SELECT  uo.id, uo.username, COUNT(ui.id) AS cnt
FROM    t_myisam_user uo
LEFT JOIN
        t_myisam_user ui
ON      ui.invited_by = uo.id
GROUP BY
        uo.id
ORDER BY
        RAND(20100130)
LIMIT 10

id	username	cnt
34429	User 34429	1
7402	User 7402	1
7109	User 7109	3
99555	User 99555	0
38015	User 38015	1
33796	User 33796	0
10230	User 10230	4
87928	User 87928	0
83541	User 83541	0
68714	User 68714	0
10 rows fetched in 0.0013s (4.8593s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	uo	index		PRIMARY	4		5	2000000.00	Using temporary; Using filesort
1	SIMPLE	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00

select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10

Subquery

SELECT  id, username,
        (
        SELECT  COUNT(*)
        FROM    t_myisam_user ui
        WHERE   ui.invited_by = uo.id
        ) AS cnt
FROM    t_myisam_user uo
ORDER BY
        RAND(20100130)
LIMIT 10

id	username	cnt
34429	User 34429	1
7402	User 7402	1
7109	User 7109	3
99555	User 99555	0
38015	User 38015	1
33796	User 33796	0
10230	User 10230	4
87928	User 87928	0
83541	User 83541	0
68714	User 68714	0
10 rows fetched in 0.0011s (2.0000s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	uo	ALL					100000	100.00	Using temporary; Using filesort
2	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	2	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by rand(20100130) limit 10

The subqueries are notably faster (2 seconds for the subqueries against 4 seconds for the GROUP BY).

This demonstrates an interesting flaw in MySQL optimizer algorithm.

We remember that MySQL used the table scan with a sort for a query without a LIMIT, and an index scan for a query using a LIMIT.

In this case, MySQL's optimizer sees two clauses: GROUP BY and LIMIT, and there is an index avaiable for GROUP BY id.

Taken together, they should make the index access path more efficient, and as we already saw in the previous section, they do, since normally LIMIT just takes first 10 records from the index.

But there is a third clause here, ORDER BY.

LIMIT of course gets applied to ORDER BY sorting, not GROUP BY one. MySQL's optimizer, however, does not take this into account. Despite the fact that the ORDER BY will need the whole recordset, the index access path is still used. This makes GROUP BY to traverse all index. Of course no sorting is done (for GROUP BY that is), but traversal itself is quite slow in MyISAM.

The subquery solution, on the other hand, has only the ORDER BY and LIMIT. This makes the optimizer to choose the table scan which is much faster when we need all records.

InnoDB

GROUP BY

SELECT  uo.id, uo.username, COUNT(ui.id) AS cnt
FROM    t_innodb_user uo
LEFT JOIN
        t_innodb_user ui
ON      ui.invited_by = uo.id
GROUP BY
        uo.id
ORDER BY
        RAND(20100130)
LIMIT 10

id	username	cnt
34429	User 34429	1
7402	User 7402	1
7109	User 7109	3
99555	User 99555	0
38015	User 38015	1
33796	User 33796	0
10230	User 10230	4
87928	User 87928	0
83541	User 83541	0
68714	User 68714	0
10 rows fetched in 0.0011s (3.2968s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	uo	index		PRIMARY	4		10	1000760.00	Using temporary; Using filesort
1	SIMPLE	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10

Subqueries

SELECT  id, username,
        (
        SELECT  COUNT(*)
        FROM    t_innodb_user ui
        WHERE   ui.invited_by = uo.id
        ) AS cnt
FROM    t_innodb_user uo
ORDER BY
        RAND(20100130)
LIMIT 10

id	username	cnt
34429	User 34429	1
7402	User 7402	1
7109	User 7109	3
99555	User 99555	0
38015	User 38015	1
33796	User 33796	0
10230	User 10230	4
87928	User 87928	0
83541	User 83541	0
68714	User 68714	0
10 rows fetched in 0.0010s (4.0155s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	uo	ALL					100076	100.00	Using temporary; Using filesort
2	DEPENDENT SUBQUERY	ui	ref	ix_user_invitedby	ix_user_invitedby	4	20100130_aggregate.uo.id	1	100.00	Using index

Field or reference '20100130_aggregate.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by rand(20100130) limit 10

Formally, InnoDB is subject to the same optimizer mistake as the one described above for MyISAM. For the GROUP BY, we see the index access path in the query plan, for subqueries, we see ALL.

But for InnoDB, these two access paths are in fact the same, since an InnoDB table is a PRIMARY KEY and the index traversal over a PRIMARY KEY is a table traversal.

So unlike MyISAM, nothing bad happens in the first case here, and the GROUP BY query runs a little bit faster than the subqueries.

Summary

Generally speaking, aggregates over a left-joined table grouped by the main table's primary key, and the aggregate subqueries over the same table yield the same results and use almost the same algorithms in MySQL.

However, they still differ performance-wise.

On the one hand, the subqueries require some overhead and are executed several percents more slowly than the joins.

On the other hand, GROUP BY in MySQL requires sorting the joined recordset on the GROUP BY expressions.

For InnoDB, the optimizer mostly makes the optimal decisions and does not sort the recordset, since an InnoDB table is always ordered by the PRIMARY KEY and the records naturally come in that order out of the table. The optimizer is aware of that.

For MyISAM, the PRIMARY KEY index does not contain the table values. The table needs to be looked up which requires extra work. That's why in MyISAM the optimizer often makes incorrect decisions about whether or not use the index sort order or sort records taken from the table. These incorrect decisions lead to the subqueries being more efficient.

For MyISAM tables, the subqueries are often a better alternative to the GROUP BY.

For InnoDB tables, the subqueries and the GROUP BY complete in almost same time, but GROUP BY is still several percent more efficient.

With InnoDB, the GROUP BY queries over a left-joined table should be preferred over running the aggregate subqueries.

Written by Quassnoi

January 30th, 2010 at 11:00 pm

Posted in MySQL

One Response to 'Aggregates: subqueries vs. GROUP BY'

Subscribe to comments with RSS

thanks a lot very helpful, even though i’m using postgres it gives me an idea where to start

ali

30 Sep 14 at 20:11

EXPLAIN EXTENDED