UNION vs UNION ALL at EXPLAIN EXTENDED

UNION vs UNION ALL

Comments enabled. I *really* need your comment

In my previous article:

Ordering UNION’s

I described efficient ORDER BY along with LIMIT for UNION'ed resultsets.

One thing left unanswered: why UNION DISTINCT in some cases is faster than UNION ALL?

UNION DISTINCT (for which a mere UNION is an alias) implies selecting DISTINCT values from the resultsets, while UNION ALL just selects all values, including duplicates.

On one hand, UNION ALL does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.

If some complex calculations are performed on these rows, this may kill all performance benefits.

Let's create the sample tables:

CREATE TABLE t_a (
        id INT NOT NULL PRIMARY KEY,
        orderer INT NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_a_orderer (orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_b (
        id INT NOT NULL PRIMARY KEY,
        orderer INT NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_b_orderer (orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(200000);
COMMIT;

INSERT
INTO    t_a (id, orderer, value)
SELECT  id, RAND(20090516) * 100, CONCAT('Value A ', id)
FROM    filler;

INSERT
INTO    t_b (id, orderer, value)
SELECT  id, RAND(20090516 << 1) * 100, CONCAT('Value B ', id)
FROM    filler;
&#91;/sourcecode&#93;

We have 200,000 rows in each tables, and randomly filled <code>orderer</code> fields, which take values from <strong>0</strong> to <strong>99</strong>.

Now, let's look into two queries:


SELECT  orderer % 10
FROM    t_a
UNION ALL
SELECT  orderer % 10
FROM    t_b
ORDER BY
        1
LIMIT 10

orderer % 10
0
0
0
0
0
0
0
0
0
0
10 rows fetched in 0.0002s (0.5396s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	t_a	index	ix_a_orderer	4	200007	100.00	Using index
2	UNION	t_b	index	ix_b_orderer	4	200834	100.00	Using index
	UNION RESULT	<union1,2>	ALL					Using filesort

SELECT  orderer % 10
FROM    t_a
UNION
SELECT  orderer % 10
FROM    t_b
ORDER BY
        1
LIMIT 10

orderer % 10
0
1
2
3
4
5
6
7
8
9
10 rows fetched in 0.0002s (0.4678s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	t_a	index	ix_a_orderer	4	200007	100.00	Using index
2	UNION	t_b	index	ix_b_orderer	4	200834	100.00	Using index
	UNION RESULT	<union1,2>	ALL					Using filesort

We see that the UNION query is apparently faster than a UNION ALL one.

That's because there are no so many DISTINCT values (only 10) and eliminating the duplicates is almost free, while sorting is performed on a much smaller recordset.

The previous article described how to improve the UNION ALL query.

Now, we can improve the UNION query.

We have a deterministic expression in the SELECT clause. That means that the expression gives same results for same input values.

So if we calculate a list of distinct input values beforehand, this will not affect our recordset, because equal values will give us equal results anyway. Doing this is much easier since we have an index on orderer.

Then we just repeat the UNION to get rid of duplicates that appeared after we calculated the expressions, but this time there will be much fewer expressions to calculate (which is good by itself), and much fewer duplicates to eliminate.

SELECT  DISTINCT orderer % 10
FROM    (
        SELECT  DISTINCT orderer
        FROM    t_a
        UNION
        SELECT  DISTINCT orderer
        FROM    t_b
        ) q
ORDER BY
        1
LIMIT 10

orderer % 10
0
1
2
3
4
5
6
7
8
9
10 rows fetched in 0.0002s (0.1974s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL			101	100.00	Using temporary; Using filesort
2	DERIVED	t_a	index	ix_a_orderer	4	199875	100.00	Using index
3	UNION	t_b	index	ix_b_orderer	4	200046	100.00	Using index
	UNION RESULT	<union2,3>	ALL

Now, the DISTINCT's are precalculated, and this helped to reduce the query time from 0.47 to less than 0.20 seconds.

Written by Quassnoi

May 16th, 2009 at 11:00 pm

Posted in MySQL

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

EXPLAIN EXTENDED