EXPLAIN EXTENDED

How to create fast database queries

Correlated LEFT JOIN in a subquery

with one comment

Answering questions asked on the site.

Beren asks:

In my database, I store the information about the game matches.

A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players.

I store the matches in a table like this:

Game
id player1 player2 type

, where type is ENUM(player, team).

If type is player, the ids of players are stored in the record; if type is team, those of teams are stored.

Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.

Are two separate queries required for this?

This of course can be easily done using a single query.

Let's make a sample table and see:

Table creation details

The tables contain 500,000 games, 100,000 players and 3,000 teams.

The participants in the games are distributed logarithmically: 95% of all games are single player, 95% of the remaining games are played by 2×2 teams, etc.:

players COUNT(*)
1 483335
2 16111
3 539
4 15

The model above may seem quite weird, since the single players can be assigned to the fake teams so that the joins will be always performed through the player_team table. However, it's not required and in fact is less efficient. We don't really need that extra lookup if we can find a player right away, using its id, and since the majority of games are single player, majority of queries will be too.

Let's see how we can build a list of players' names. First, let's do it for the player1 part only:

SELECT  g.id, g.name, GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game g
LEFT JOIN
player_team pt
ON      g.type = 'team'
AND pt.team = g.player1
JOIN    player p
ON      p.id = CASE g.type WHEN 'player' THEN g.player1 WHEN 'team' THEN pt.player END
GROUP BY
g.id
LIMIT 50
id name players
1 Game 1 Player 88151
2 Game 2 Player 60091
3 Game 3 Player 36003
4 Game 4 Player 99741
5 Game 5 Player 90694
37 Game 37 Player 63486
38 Game 38 Player 78226, Player 82189
39 Game 39 Player 34156
48 Game 48 Player 66763
49 Game 49 Player 85215
50 Game 50 Player 25784
50 rows fetched in 0.0014s (0.0029s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE g index PRIMARY 4 50 1004398.00
1 SIMPLE pt ref PRIMARY PRIMARY 4 20101020_matches.g.player1 1 100.00 Using index
1 SIMPLE p eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `g` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`g`.`player1`) and (`20101020_matches`.`g`.`type` = 'team'))) join `20101020_matches`.`player` `p` where (`20101020_matches`.`p`.`id` = coalesce(`20101020_matches`.`pt`.`player`,`20101020_matches`.`g`.`player1`)) group by `20101020_matches`.`g`.`id` limit 50

We see that the query returns the lists of players correctly both for single-player games and team ones (like the one you can see at Game 38). The query works in the following way:

  • For each record from game, it determines whether the game is single player or a team match.
  • If the game is multiplayer, it joins the player_team table. The query is written so that the game type is a part of a LEFT JOIN condition which can be short-circuited by the optimizer. This means that if a game is a single player game, no actual lookup is performed against player_team: the optimizer will just optimize away this step, knowing for sure that it just needs to return a NULL.
  • Finally, the player table is joined, using CASE. The case will evaluate to either player_team.player or to game.player1 (depending on the game type).

Theoretically, it could be replaced with a mere COALESCE(player_team.player, game.player1); however, if the game type were team but the team itself were empty, this expression would incorrectly evaluate to game.player1, not regarding the game type. So it's better to leave it as is.

Now, we have an aggregate of the players' names from the one side. How do we make the same for two sides?

Since we need to calculate aggregates for two different columns, it's better to replace the JOINs with the correlated subqueries. However, there is a little problem.

Normally, a correlated subquery just replaces the ON condition with a WHERE condition. But this is with a plain JOIN, and we here have a LEFT JOIN. We cannot just replace it with WHERE: it won't match anything.

Unfortunately, MySQL has a very limited ability of correlating the subqueries. For instance, a correlated value cannot be nested more than one level deep, neither can is be used in an ON clause of a join. That's why we will need to make one extra join in our subqueries.

To rewrite the aggregate query as subqueries, we will need to join game to itselft in the subqueries and use the values of a joined table instead of the correlated values:

SELECT  g.id, g.name,
(
SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game gi
LEFT JOIN
player_team pt
ON      gi.type = 'team'
AND pt.team = gi.player1
JOIN    player p
ON      p.id = CASE gi.type WHEN 'player' THEN gi.player1 WHEN 'team' THEN pt.player END
WHERE   gi.id = g.id
),
(
SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game gi
LEFT JOIN
player_team pt
ON      gi.type = 'team'
AND pt.team = gi.player2
JOIN    player p
ON      p.id = CASE gi.type WHEN 'player' THEN gi.player2 WHEN 'team' THEN pt.player END
WHERE   gi.id = g.id
)
FROM    game g
ORDER BY
id
LIMIT 50
id name playerlist1 playerlist2
1 Game 1 Player 88151 Player 6037
2 Game 2 Player 60091 Player 54100
3 Game 3 Player 36003 Player 52387
4 Game 4 Player 99741 Player 99634
5 Game 5 Player 90694 Player 41009
37 Game 37 Player 63486 Player 3322
38 Game 38 Player 78226, Player 82189 Player 4761, Player 98354
39 Game 39 Player 34156 Player 82799
48 Game 48 Player 66763 Player 41569
49 Game 49 Player 85215 Player 6814
50 Game 50 Player 25784 Player 9361
50 rows fetched in 0.0018s (0.0050s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY g index PRIMARY 4 50 1004398.00
3 DEPENDENT SUBQUERY gi eq_ref PRIMARY PRIMARY 4 20101020_matches.g.id 1 100.00
3 DEPENDENT SUBQUERY pt ref PRIMARY PRIMARY 4 20101020_matches.gi.player2 1 100.00 Using index
3 DEPENDENT SUBQUERY p eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
2 DEPENDENT SUBQUERY gi eq_ref PRIMARY PRIMARY 4 20101020_matches.g.id 1 100.00
2 DEPENDENT SUBQUERY pt ref PRIMARY PRIMARY 4 20101020_matches.gi.player1 1 100.00 Using index
2 DEPENDENT SUBQUERY p eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
Field or reference '20101020_matches.g.id' of SELECT #2 was resolved in SELECT #1
Field or reference '20101020_matches.g.id' of SELECT #3 was resolved in SELECT #1
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,(select group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player1`) and (`20101020_matches`.`gi`.`type` = 'team'))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when 'player' then `20101020_matches`.`gi`.`player1` when 'team' then `20101020_matches`.`pt`.`player` end)))) AS `(
        SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
        FROM    game gi
        LEFT JOIN  
                player_team pt
        ON      gi.type = 'team'
                AND pt.team = gi.player1
        JOIN    player p
        ON      `,(select group_concat(`20101020_matches`.`p`.`name` separator ', ') AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player2`) and (`20101020_matches`.`gi`.`type` = 'team'))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when 'player' then `20101020_matches`.`gi`.`player2` when 'team' then `20101020_matches`.`pt`.`player` end)))) AS `(
        SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
        FROM    game gi
        LEFT JOIN
                player_team pt
        ON      gi.type = 'team'
                AND pt.team = gi.player2
        JOIN    player p
        ON      p.` from `20101020_matches`.`game` `g` order by `20101020_matches`.`g`.`id` limit 50

Instead of using correlated values g.type and g.player*, we join another instance of game (aliased as gi) on g.id and use its fields in the join instead.

This of course adds some overhead to the query, however, the PRIMARY KEY join on an already cached record is quite fast.

Let's run two queries that select player lists for all games and compare the execution times.

First, the grouping query:

SELECT  SUM(LENGTH(players))
FROM    (
SELECT  g.id, g.name, GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game g
LEFT JOIN
player_team pt
ON      g.type = 'team'
AND pt.team = g.player1
JOIN    player p
ON      p.id = CASE g.type WHEN 'player' THEN g.player1 WHEN 'team' THEN pt.player END
GROUP BY
g.id
ORDER BY
NULL
) q

Show query details

And the subqueries:

SELECT  g.id, g.name,
SUM(LENGTH(
(
SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game gi
LEFT JOIN
player_team pt
ON      gi.type = 'team'
AND pt.team = gi.player1
JOIN    player p
ON      p.id = CASE gi.type WHEN 'player' THEN gi.player1 WHEN 'team' THEN pt.player END
WHERE   gi.id = g.id
)
)) AS len1,
SUM(LENGTH(
(
SELECT  GROUP_CONCAT(p.name SEPARATOR ', ') AS players
FROM    game gi
LEFT JOIN
player_team pt
ON      gi.type = 'team'
AND pt.team = gi.player2
JOIN    player p
ON      p.id = CASE gi.type WHEN 'player' THEN gi.player2 WHEN 'team' THEN pt.player END
WHERE   gi.id = g.id
)
)) AS len2
FROM    game g

Show query details

The second query is only 40% longer, but it parses and returns twice as much data.

That means that it's more efficient to run a single query to get both lists of players than to run a grouping query twice.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

October 20th, 2010 at 11:00 pm

Posted in MySQL

One Response to 'Correlated LEFT JOIN in a subquery'

Subscribe to comments with RSS

  1. Thanks. It is very helpful.

    ram

    30 Jan 12 at 09:34

Leave a Reply