Correlated LEFT JOIN in a subquery
Answering questions asked on the site.
Beren asks:
In my database, I store the information about the game matches.
A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players.
I store the matches in a table like this:
Game id player1 player2 type , where
type
isENUM(player, team)
.If
type
isplayer
, the ids of players are stored in the record; if type isteam
, those of teams are stored.Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.
Are two separate queries required for this?
This of course can be easily done using a single query.
Let's make a sample table and see:
Table creation details
The tables contain 500,000 games, 100,000 players and 3,000 teams.
The participants in the games are distributed logarithmically: 95% of all games are single player, 95% of the remaining games are played by 2×2 teams, etc.:
players | COUNT(*) |
---|---|
1 | 483335 |
2 | 16111 |
3 | 539 |
4 | 15 |
The model above may seem quite weird, since the single players can be assigned to the fake teams so that the joins will be always performed through the player_team
table. However, it's not required and in fact is less efficient. We don't really need that extra lookup if we can find a player right away, using its id, and since the majority of games are single player, majority of queries will be too.
Let's see how we can build a list of players' names. First, let's do it for the player1
part only:
SELECT g.id, g.name, GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game g LEFT JOIN player_team pt ON g.type = 'team' AND pt.team = g.player1 JOIN player p ON p.id = CASE g.type WHEN 'player' THEN g.player1 WHEN 'team' THEN pt.player END GROUP BY g.id LIMIT 50
id | name | players | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Game 1 | Player 88151 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | Game 2 | Player 60091 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | Game 3 | Player 36003 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | Game 4 | Player 99741 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | Game 5 | Player 90694 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
37 | Game 37 | Player 63486 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
38 | Game 38 | Player 78226, Player 82189 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
39 | Game 39 | Player 34156 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | Game 48 | Player 66763 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | Game 49 | Player 85215 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | Game 50 | Player 25784 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 rows fetched in 0.0014s (0.0029s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | g | index | PRIMARY | 4 | 50 | 1004398.00 | |||
1 | SIMPLE | pt | ref | PRIMARY | PRIMARY | 4 | 20101020_matches.g.player1 | 1 | 100.00 | Using index |
1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `g` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`g`.`player1`) and (`20101020_matches`.`g`.`type` = 'team'))) join `20101020_matches`.`player` `p` where (`20101020_matches`.`p`.`id` = coalesce(`20101020_matches`.`pt`.`player`,`20101020_matches`.`g`.`player1`)) group by `20101020_matches`.`g`.`id` limit 50
We see that the query returns the lists of players correctly both for single-player games and team ones (like the one you can see at Game 38). The query works in the following way:
- For each record from
game
, it determines whether the game is single player or a team match. - If the game is multiplayer, it joins the
player_team
table. The query is written so that the game type is a part of aLEFT JOIN
condition which can be short-circuited by the optimizer. This means that if a game is a single player game, no actual lookup is performed againstplayer_team
: the optimizer will just optimize away this step, knowing for sure that it just needs to return aNULL
. - Finally, the
player
table is joined, usingCASE
. The case will evaluate to eitherplayer_team.player
or togame.player1
(depending on the game type).
Theoretically, it could be replaced with a mere COALESCE(player_team.player, game.player1)
; however, if the game type were team
but the team itself were empty, this expression would incorrectly evaluate to game.player1
, not regarding the game type. So it's better to leave it as is.
Now, we have an aggregate of the players' names from the one side. How do we make the same for two sides?
Since we need to calculate aggregates for two different columns, it's better to replace the JOINs
with the correlated subqueries. However, there is a little problem.
Normally, a correlated subquery just replaces the ON
condition with a WHERE
condition. But this is with a plain JOIN
, and we here have a LEFT JOIN
. We cannot just replace it with WHERE
: it won't match anything.
Unfortunately, MySQL
has a very limited ability of correlating the subqueries. For instance, a correlated value cannot be nested more than one level deep, neither can is be used in an ON
clause of a join. That's why we will need to make one extra join in our subqueries.
To rewrite the aggregate query as subqueries, we will need to join game
to itselft in the subqueries and use the values of a joined table instead of the correlated values:
SELECT g.id, g.name, ( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player1 JOIN player p ON p.id = CASE gi.type WHEN 'player' THEN gi.player1 WHEN 'team' THEN pt.player END WHERE gi.id = g.id ), ( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player2 JOIN player p ON p.id = CASE gi.type WHEN 'player' THEN gi.player2 WHEN 'team' THEN pt.player END WHERE gi.id = g.id ) FROM game g ORDER BY id LIMIT 50
id | name | playerlist1 | playerlist2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Game 1 | Player 88151 | Player 6037 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | Game 2 | Player 60091 | Player 54100 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | Game 3 | Player 36003 | Player 52387 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | Game 4 | Player 99741 | Player 99634 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | Game 5 | Player 90694 | Player 41009 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
37 | Game 37 | Player 63486 | Player 3322 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
38 | Game 38 | Player 78226, Player 82189 | Player 4761, Player 98354 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
39 | Game 39 | Player 34156 | Player 82799 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | Game 48 | Player 66763 | Player 41569 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | Game 49 | Player 85215 | Player 6814 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | Game 50 | Player 25784 | Player 9361 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 rows fetched in 0.0018s (0.0050s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | g | index | PRIMARY | 4 | 50 | 1004398.00 | |||
3 | DEPENDENT SUBQUERY | gi | eq_ref | PRIMARY | PRIMARY | 4 | 20101020_matches.g.id | 1 | 100.00 | |
3 | DEPENDENT SUBQUERY | pt | ref | PRIMARY | PRIMARY | 4 | 20101020_matches.gi.player2 | 1 | 100.00 | Using index |
3 | DEPENDENT SUBQUERY | p | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | gi | eq_ref | PRIMARY | PRIMARY | 4 | 20101020_matches.g.id | 1 | 100.00 | |
2 | DEPENDENT SUBQUERY | pt | ref | PRIMARY | PRIMARY | 4 | 20101020_matches.gi.player1 | 1 | 100.00 | Using index |
2 | DEPENDENT SUBQUERY | p | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
Field or reference '20101020_matches.g.id' of SELECT #2 was resolved in SELECT #1 Field or reference '20101020_matches.g.id' of SELECT #3 was resolved in SELECT #1 select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,(select group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player1`) and (`20101020_matches`.`gi`.`type` = 'team'))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when 'player' then `20101020_matches`.`gi`.`player1` when 'team' then `20101020_matches`.`pt`.`player` end)))) AS `( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player1 JOIN player p ON `,(select group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player2`) and (`20101020_matches`.`gi`.`type` = 'team'))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when 'player' then `20101020_matches`.`gi`.`player2` when 'team' then `20101020_matches`.`pt`.`player` end)))) AS `( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player2 JOIN player p ON p.` from `20101020_matches`.`game` `g` order by `20101020_matches`.`g`.`id` limit 50
Instead of using correlated values g.type
and g.player*
, we join another instance of game
(aliased as gi
) on g.id
and use its fields in the join instead.
This of course adds some overhead to the query, however, the PRIMARY KEY
join on an already cached record is quite fast.
Let's run two queries that select player lists for all games and compare the execution times.
First, the grouping query:
SELECT SUM(LENGTH(players)) FROM ( SELECT g.id, g.name, GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game g LEFT JOIN player_team pt ON g.type = 'team' AND pt.team = g.player1 JOIN player p ON p.id = CASE g.type WHEN 'player' THEN g.player1 WHEN 'team' THEN pt.player END GROUP BY g.id ORDER BY NULL ) q
And the subqueries:
SELECT g.id, g.name, SUM(LENGTH( ( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player1 JOIN player p ON p.id = CASE gi.type WHEN 'player' THEN gi.player1 WHEN 'team' THEN pt.player END WHERE gi.id = g.id ) )) AS len1, SUM(LENGTH( ( SELECT GROUP_CONCAT(p.name SEPARATOR ', ') AS players FROM game gi LEFT JOIN player_team pt ON gi.type = 'team' AND pt.team = gi.player2 JOIN player p ON p.id = CASE gi.type WHEN 'player' THEN gi.player2 WHEN 'team' THEN pt.player END WHERE gi.id = g.id ) )) AS len2 FROM game g
The second query is only 40% longer, but it parses and returns twice as much data.
That means that it's more efficient to run a single query to get both lists of players than to run a grouping query twice.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Thanks. It is very helpful.
ram
30 Jan 12 at 09:34