From Stack Overflow:
I have a table users
and there is a field invited_by_id
showing user id of the person who invited this user.
I need to make a MySQL query returning rows with all the fields from users plus a invites_count
field showing how many people were invited by each user.
The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.
GROUP BY
The first approach is using GROUP BY
:
SELECT u.*, COUNT(ui.id)
FROM users u
LEFT JOIN
users ui
ON ui.invited_by = u.id
GROUP BY
u.id
This is a GROUP BY
with a self join, very simple. There are only two little things I'd like to pay some attention to.
First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN
, and use COUNT(ui.id)
instead of COUNT(*)
, because, due to the very nature of aggregation, COUNT(*)
returns at least 1 in a query with GROUP BY
, and COUNT(ui.id)
skips NULL
s (which can only result from a LEFT JOIN
miss).
Second, we group by u.id
but use u.*
in the SELECT
clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).
This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY
clause with all fields from users
if we have already grouped by the PRIMARY KEY
which is already unique. All other values from users
are uniquely defined by the PRIMARY KEY
so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.
Subquery
This solution involves correlated subqueries:
SELECT u.*,
(
SELECT COUNT(*)
FROM users ui
WHERE ui.invited_by = u.id
)
FROM users u
Here, we calculate the COUNT(*)
in a correlated subquery. This query returns exactly same records as the previous one.
Comparison
Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Table creation details
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_innodb_user (
id INT NOT NULL PRIMARY KEY,
invited_by INT NOT NULL,
username VARCHAR(30) NOT NULL,
stuffing VARCHAR(100) NOT NULL,
KEY ix_user_invitedby (invited_by)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE t_myisam_user (
id INT NOT NULL PRIMARY KEY,
invited_by INT NOT NULL,
username VARCHAR(30) NOT NULL,
stuffing VARCHAR(100) NOT NULL,
KEY ix_user_invitedby (invited_by)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(100000);
COMMIT;
INSERT
INTO t_innodb_user
SELECT id, FLOOR(RAND(20100130) * id), CONCAT('User ', id), RPAD('', 100, '*')
FROM filler;
INSERT
INTO t_myisam_user
SELECT *
FROM t_innodb_user;
[/sourcecode]
</div>
There are two identical tables, one using <strong>MyISAM</strong> storage engine, the other one using <strong>InnoDB</strong>. Each table contains <strong>100,000</strong> records.
The field <code>invited_by</code> is filled with a random value from <strong>0</strong> (including) to current <code>id</code> (not including). <strong>0</strong> means that the user was not invited (and really, who could invite the first user?).
<h3>All records</h3>
We well calculate the number of members invited by each user. For the sake of brevity we will aggregate the returned records.
<h4>MyISAM</h4>
<a href="#" onclick="xcollapse('X3621');return false;"><strong>GROUP BY</strong></a>
<br />
<div id="X3621" style="display: none; ">
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*, COUNT(ui.id) AS cnt
FROM t_myisam_user uo
LEFT JOIN
t_myisam_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100000 |
10000000 |
99986 |
1 row fetched in 0.0004s (13.9529s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100000 |
100.00 |
|
2 |
DERIVED |
uo |
ALL |
|
|
|
|
100000 |
100.00 |
Using temporary; Using filesort |
2 |
DERIVED |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
|
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`
Subquery
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*,
(
SELECT COUNT(*)
FROM t_myisam_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_myisam_user uo
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100000 |
10000000 |
99986 |
1 row fetched in 0.0004s (2.9531s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100000 |
100.00 |
|
2 |
DERIVED |
uo |
ALL |
|
|
|
|
100000 |
100.00 |
|
3 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo`) `q`
We see that subquery solution is much more efficient: 3 seconds against 14 seconds, almost 5 times.
What is the reason?
If we look into the query plans we will see that MySQL uses temporary
and filesort
for the GROUP BY
query.
MySQL knows only one way to aggregate the records, namely, sorting. To execute any GROUP BY
statement it should first order the records according to GROUP BY
conditions. This naturally returns records sorted by GROUP BY
expressions and MySQL even cared to document this behavior.
There is a PRIMARY KEY
index on id
field, but traversing the indexes is quite slow in MySQL, so the optimizer preferred to do a filesort
.
The subqueries, on the other hand, do not require any additional sorting, so retrieving the values aggregated in the subqueries is much faster. The records, of course, will not return in any specific order, but it wasn't required anyway.
InnoDB
GROUP BY
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*, COUNT(ui.id) AS cnt
FROM t_innodb_user uo
LEFT JOIN
t_innodb_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100000 |
10000000 |
99986 |
1 row fetched in 0.0004s (4.2812s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100000 |
100.00 |
|
2 |
DERIVED |
uo |
index |
|
PRIMARY |
4 |
|
100076 |
100.00 |
|
2 |
DERIVED |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`
Subquery
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*,
(
SELECT COUNT(*)
FROM t_innodb_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_innodb_user uo
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100000 |
10000000 |
99986 |
1 row fetched in 0.0004s (4.9530s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100000 |
100.00 |
|
2 |
DERIVED |
uo |
ALL |
|
|
|
|
100191 |
100.00 |
|
3 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo`) `q`
With InnoDB, both queries complete in almost same time, but the GROUP BY
query is still a little bit faster.
Unlike MyISAM, with InnoDB tables the optimizer chooses the index access path which avoids GROUP BY sorting.
This is because InnoDB tables are index-organized, and the PRIMARY KEY
is the table itself. So the table scan and the PRIMARY KEY
scan are in fact the same thing in InnoDB and there is no point in additional sorting.
The algorithms behind the LEFT JOIN
and the subqueries are in fact the same: just a single index range scan. However, due to some implementation issues, the subquery access requires some additional overhead which makes the subquery to run about 15% longer than its LEFT JOIN
counterpart.
LIMIT with index
Let's try the same queries, but now just return the first 100 records.
MyISAM
GROUP BY
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*, COUNT(ui.id) AS cnt
FROM t_myisam_user uo
LEFT JOIN
t_myisam_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
LIMIT 100
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100 |
10000 |
811 |
1 row fetched in 0.0003s (0.0315s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100 |
100.00 |
|
2 |
DERIVED |
uo |
index |
|
PRIMARY |
4 |
|
100000 |
100.00 |
Using filesort |
2 |
DERIVED |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
|
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`
Subquery
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*,
(
SELECT COUNT(*)
FROM t_myisam_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_myisam_user uo
ORDER BY
uo.id
LIMIT 100
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100 |
10000 |
811 |
1 row fetched in 0.0004s (0.0103s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100 |
100.00 |
|
2 |
DERIVED |
uo |
index |
|
PRIMARY |
4 |
|
100 |
100000.00 |
|
3 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`
In this case, optimizer makes the query to use the index even despite the fact that the table uses MyISAM, because LIMIT
makes index traversal cheaper than sorting.
However, for some strange reason, the engine still sorts the final resultset (which returns already sorted from the GROUP BY
), and the GROUP BY
query is 3 times less efficient than the subqueries.
InnoDB
GROUP BY
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*, COUNT(ui.id) AS cnt
FROM t_innodb_user uo
LEFT JOIN
t_innodb_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
LIMIT 100
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100 |
10000 |
811 |
1 row fetched in 0.0004s (0.0091s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100 |
100.00 |
|
2 |
DERIVED |
uo |
index |
|
PRIMARY |
4 |
|
100076 |
100.00 |
|
2 |
DERIVED |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`
Subquery
SELECT COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM (
SELECT uo.*,
(
SELECT COUNT(*)
FROM t_innodb_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_innodb_user uo
ORDER BY
uo.id
LIMIT 100
) q
COUNT(*) |
SUM(LENGTH(stuffing)) |
SUM(cnt) |
100 |
10000 |
811 |
1 row fetched in 0.0004s (0.0097s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100 |
100.00 |
|
2 |
DERIVED |
uo |
index |
|
PRIMARY |
4 |
|
100 |
100076.00 |
|
3 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`
With InnoDB tables, optimizer makes no final filesort
, so both solutions take almost the same time (albeit the subqueries are several percent less efficient again).
LIMIT without index
Now, let's run the queries that limit the resultsets ordered by an unindexed expression. Usually, this means ordering by RAND()
to show, say, 10 random users.
MyISAM
GROUP BY
SELECT uo.id, uo.username, COUNT(ui.id) AS cnt
FROM t_myisam_user uo
LEFT JOIN
t_myisam_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
ORDER BY
RAND(20100130)
LIMIT 10
id |
username |
cnt |
34429 |
User 34429 |
1 |
7402 |
User 7402 |
1 |
7109 |
User 7109 |
3 |
99555 |
User 99555 |
0 |
38015 |
User 38015 |
1 |
33796 |
User 33796 |
0 |
10230 |
User 10230 |
4 |
87928 |
User 87928 |
0 |
83541 |
User 83541 |
0 |
68714 |
User 68714 |
0 |
10 rows fetched in 0.0013s (4.8593s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
uo |
index |
|
PRIMARY |
4 |
|
5 |
2000000.00 |
Using temporary; Using filesort |
1 |
SIMPLE |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
|
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10
Subquery
SELECT id, username,
(
SELECT COUNT(*)
FROM t_myisam_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_myisam_user uo
ORDER BY
RAND(20100130)
LIMIT 10
id |
username |
cnt |
34429 |
User 34429 |
1 |
7402 |
User 7402 |
1 |
7109 |
User 7109 |
3 |
99555 |
User 99555 |
0 |
38015 |
User 38015 |
1 |
33796 |
User 33796 |
0 |
10230 |
User 10230 |
4 |
87928 |
User 87928 |
0 |
83541 |
User 83541 |
0 |
68714 |
User 68714 |
0 |
10 rows fetched in 0.0011s (2.0000s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
uo |
ALL |
|
|
|
|
100000 |
100.00 |
Using temporary; Using filesort |
2 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
2 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by rand(20100130) limit 10
The subqueries are notably faster (2 seconds for the subqueries against 4 seconds for the GROUP BY
).
This demonstrates an interesting flaw in MySQL optimizer algorithm.
We remember that MySQL used the table scan with a sort for a query without a LIMIT
, and an index scan for a query using a LIMIT
.
In this case, MySQL's optimizer sees two clauses: GROUP BY
and LIMIT
, and there is an index avaiable for GROUP BY id
.
Taken together, they should make the index access path more efficient, and as we already saw in the previous section, they do, since normally LIMIT
just takes first 10 records from the index.
But there is a third clause here, ORDER BY
.
LIMIT
of course gets applied to ORDER BY
sorting, not GROUP BY
one. MySQL's optimizer, however, does not take this into account. Despite the fact that the ORDER BY
will need the whole recordset, the index access path is still used. This makes GROUP BY
to traverse all index. Of course no sorting is done (for GROUP BY
that is), but traversal itself is quite slow in MyISAM.
The subquery solution, on the other hand, has only the ORDER BY
and LIMIT
. This makes the optimizer to choose the table scan which is much faster when we need all records.
InnoDB
GROUP BY
SELECT uo.id, uo.username, COUNT(ui.id) AS cnt
FROM t_innodb_user uo
LEFT JOIN
t_innodb_user ui
ON ui.invited_by = uo.id
GROUP BY
uo.id
ORDER BY
RAND(20100130)
LIMIT 10
id |
username |
cnt |
34429 |
User 34429 |
1 |
7402 |
User 7402 |
1 |
7109 |
User 7109 |
3 |
99555 |
User 99555 |
0 |
38015 |
User 38015 |
1 |
33796 |
User 33796 |
0 |
10230 |
User 10230 |
4 |
87928 |
User 87928 |
0 |
83541 |
User 83541 |
0 |
68714 |
User 68714 |
0 |
10 rows fetched in 0.0011s (3.2968s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
uo |
index |
|
PRIMARY |
4 |
|
10 |
1000760.00 |
Using temporary; Using filesort |
1 |
SIMPLE |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10
Subqueries
SELECT id, username,
(
SELECT COUNT(*)
FROM t_innodb_user ui
WHERE ui.invited_by = uo.id
) AS cnt
FROM t_innodb_user uo
ORDER BY
RAND(20100130)
LIMIT 10
id |
username |
cnt |
34429 |
User 34429 |
1 |
7402 |
User 7402 |
1 |
7109 |
User 7109 |
3 |
99555 |
User 99555 |
0 |
38015 |
User 38015 |
1 |
33796 |
User 33796 |
0 |
10230 |
User 10230 |
4 |
87928 |
User 87928 |
0 |
83541 |
User 83541 |
0 |
68714 |
User 68714 |
0 |
10 rows fetched in 0.0010s (4.0155s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
uo |
ALL |
|
|
|
|
100076 |
100.00 |
Using temporary; Using filesort |
2 |
DEPENDENT SUBQUERY |
ui |
ref |
ix_user_invitedby |
ix_user_invitedby |
4 |
20100130_aggregate.uo.id |
1 |
100.00 |
Using index |
Field or reference '20100130_aggregate.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by rand(20100130) limit 10
Formally, InnoDB is subject to the same optimizer mistake as the one described above for MyISAM. For the GROUP BY
, we see the index
access path in the query plan, for subqueries, we see ALL
.
But for InnoDB, these two access paths are in fact the same, since an InnoDB table is a PRIMARY KEY
and the index traversal over a PRIMARY KEY
is a table traversal.
So unlike MyISAM, nothing bad happens in the first case here, and the GROUP BY
query runs a little bit faster than the subqueries.
Summary
Generally speaking, aggregates over a left-joined table grouped by the main table's primary key, and the aggregate subqueries over the same table yield the same results and use almost the same algorithms in MySQL.
However, they still differ performance-wise.
On the one hand, the subqueries require some overhead and are executed several percents more slowly than the joins.
On the other hand, GROUP BY
in MySQL requires sorting the joined recordset on the GROUP BY
expressions.
For InnoDB, the optimizer mostly makes the optimal decisions and does not sort the recordset, since an InnoDB table is always ordered by the PRIMARY KEY
and the records naturally come in that order out of the table. The optimizer is aware of that.
For MyISAM, the PRIMARY KEY
index does not contain the table values. The table needs to be looked up which requires extra work. That's why in MyISAM the optimizer often makes incorrect decisions about whether or not use the index sort order or sort records taken from the table. These incorrect decisions lead to the subqueries being more efficient.
For MyISAM tables, the subqueries are often a better alternative to the GROUP BY
.
For InnoDB tables, the subqueries and the GROUP BY
complete in almost same time, but GROUP BY
is still several percent more efficient.
With InnoDB, the GROUP BY
queries over a left-joined table should be preferred over running the aggregate subqueries.
thanks a lot very helpful, even though i’m using postgres it gives me an idea where to start
ali
30 Sep 14 at 20:11