EXPLAIN EXTENDED

How to create fast database queries

Emulating FULL OUTER JOIN in MySQL

Comments enabled. I *really* need your comment

It is widely known that MySQL lacks support for FULL OUTER JOIN.

A common solution to work around this is using a UNION ALL to union two result sets from a LEFT JOIN and a RIGHT JOIN of two tables, adding join_column IS NULL condition to the latter.

Though it works, it's quite inefficient on large tables when used with ORDER BY … LIMIT queries, as it uses a filesort.

Let's create two sample tables and OUTER JOIN them together.

The tables are quite simple: each of them contains a million of multiples of 13 and 17, respectively, thus making each of LEFT JOIN, RIGHT JOIN and INNER JOIN of these tables non-empty:

Table creation details

A commonly suggested workaround looks like this:

SELECT  t_13.value AS val13, t_17.value AS val17
FROM    t_13
LEFT JOIN
t_17
ON      t_13.value = t_17.value
UNION ALL
SELECT  t_13.value AS val13, t_17.value AS val17
FROM    t_13
RIGHT JOIN
t_17
ON      t_13.value = t_17.value
WHERE   t_13.value IS NULL
ORDER BY
COALESCE(val13, val17)
LIMIT 30

val13 val17
13
17
26
34
39
51
52
65
68
78
85
91
102
104
117
119
130
136
143
153
156
169
170
182
187
195
204
208
221 221
234
30 rows fetched in 0.0018s (17.9216s)

, and works for almost 18 seconds.

We will try to emulate it using a MERGE JOIN algorithm.

As of 5.2.0, MySQL lacks a native support of MERGE JOIN. But basically it look like this:

  1. Allocate a variable called pointer
  2. Set the pointer to the least of minimal value's of both tables
  3. Select all rows from both tables with value equal to the pointer, cross joining them. If the value equal to the pointer is not present in either table, a single NULL row is selected from that table instead
  4. Set the pointer to the least of the minimal value's greater than the current value of pointer from both tables
  5. Repeat steps 3 and 4 until step 4 yields some values from either table

Unfortunately, due to the bug #20111, MySQL will not allow to implement step 4 efficiently using subqueries.

That's why we'll need to create UDF's to select a value next to the given one:

CREATE FUNCTION fn_min_13(_value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _ret INT;
SELECT  MIN(value)
INTO    _ret
FROM    t_13
WHERE   value > _value;
RETURN _ret;
END
$$

CREATE FUNCTION fn_min_17(_value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _ret INT;
SELECT  MIN(value)
INTO    _ret
FROM    t_17
WHERE   value > _value;
RETURN _ret;
END
$$

Here's the query itself:

SELECT  o13.value AS val13, o17.value AS val17
FROM    (
SELECT  @r AS _current,
fn_min_13(@r) AS val13,
fn_min_17(@r) AS val17,
@r := (
SELECT  LEAST(COALESCE(val13, val17), COALESCE(val17, val13))
) AS nextval
FROM    (
SELECT  @r := -1
) vars,
(
SELECT  id
FROM    (
SELECT  id
FROM    t_13
LIMIT 30
) q13
UNION ALL
SELECT  id
FROM    (
SELECT  id
FROM    t_17
LIMIT 30
) q17
) d
) dn
LEFT JOIN
t_13 o13
ON      o13.value = dn.nextval
LEFT JOIN
t_17 o17
ON      o17.value = dn.nextval
LIMIT 30
val13 val17
13
17
26
34
39
51
52
65
68
78
85
91
102
104
117
119
130
136
143
153
156
169
170
182
187
195
204
208
221 221
234
30 rows fetched in 0.0007s (0.0189s)

Almost instant.

Note that we are using the tables themselves as dummy rowsets for iterating the steps.

We can't assume that either of the tables contains the number of rows we need. That's why we'll just add LIMIT 30 to each of the tables and then LIMIT 30 the whole query.

Written by Quassnoi

April 6th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply