EXPLAIN EXTENDED

How to create fast database queries

UNION vs UNION ALL

Comments enabled. I *really* need your comment

In my previous article:

I described efficient ORDER BY along with LIMIT for UNION'ed resultsets.

One thing left unanswered: why UNION DISTINCT in some cases is faster than UNION ALL?

UNION DISTINCT (for which a mere UNION is an alias) implies selecting DISTINCT values from the resultsets, while UNION ALL just selects all values, including duplicates.

On one hand, UNION ALL does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.

If some complex calculations are performed on these rows, this may kill all performance benefits.

Let's create the sample tables:

CREATE TABLE t_a (
        id INT NOT NULL PRIMARY KEY,
        orderer INT NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_a_orderer (orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_b (
        id INT NOT NULL PRIMARY KEY,
        orderer INT NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_b_orderer (orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(200000);
COMMIT;

INSERT
INTO    t_a (id, orderer, value)
SELECT  id, RAND(20090516) * 100, CONCAT('Value A ', id)
FROM    filler;

INSERT
INTO    t_b (id, orderer, value)
SELECT  id, RAND(20090516 << 1) * 100, CONCAT('Value B ', id)
FROM    filler;
&#91;/sourcecode&#93;

We have 200,000 rows in each tables, and randomly filled <code>orderer</code> fields, which take values from <strong>0</strong> to <strong>99</strong>.

Now, let's look into two queries:


SELECT  orderer % 10
FROM    t_a
UNION ALL
SELECT  orderer % 10
FROM    t_b
ORDER BY
        1
LIMIT 10

orderer % 10
0
0
0
0
0
0
0
0
0
0
10 rows fetched in 0.0002s (0.5396s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t_a index ix_a_orderer 4 200007 100.00 Using index
2 UNION t_b index ix_b_orderer 4 200834 100.00 Using index
UNION RESULT <union1,2> ALL Using filesort
SELECT  orderer % 10
FROM    t_a
UNION
SELECT  orderer % 10
FROM    t_b
ORDER BY
        1
LIMIT 10

orderer % 10
0
1
2
3
4
5
6
7
8
9
10 rows fetched in 0.0002s (0.4678s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t_a index ix_a_orderer 4 200007 100.00 Using index
2 UNION t_b index ix_b_orderer 4 200834 100.00 Using index
UNION RESULT <union1,2> ALL Using filesort

We see that the UNION query is apparently faster than a UNION ALL one.

That's because there are no so many DISTINCT values (only 10) and eliminating the duplicates is almost free, while sorting is performed on a much smaller recordset.

The previous article described how to improve the UNION ALL query.

Now, we can improve the UNION query.

We have a deterministic expression in the SELECT clause. That means that the expression gives same results for same input values.

So if we calculate a list of distinct input values beforehand, this will not affect our recordset, because equal values will give us equal results anyway. Doing this is much easier since we have an index on orderer.

Then we just repeat the UNION to get rid of duplicates that appeared after we calculated the expressions, but this time there will be much fewer expressions to calculate (which is good by itself), and much fewer duplicates to eliminate.

SELECT  DISTINCT orderer % 10
FROM    (
        SELECT  DISTINCT orderer
        FROM    t_a
        UNION
        SELECT  DISTINCT orderer
        FROM    t_b
        ) q
ORDER BY
        1
LIMIT 10
orderer % 10
0
1
2
3
4
5
6
7
8
9
10 rows fetched in 0.0002s (0.1974s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 101 100.00 Using temporary; Using filesort
2 DERIVED t_a index ix_a_orderer 4 199875 100.00 Using index
3 UNION t_b index ix_b_orderer 4 200046 100.00 Using index
UNION RESULT <union2,3> ALL

Now, the DISTINCT's are precalculated, and this helped to reduce the query time from 0.47 to less than 0.20 seconds.

Written by Quassnoi

May 16th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply