Constant vs. NULL to mark missing values in OUTER JOINs at EXPLAIN EXTENDED

Constant vs. NULL to mark missing values in OUTER JOINs

Comments enabled. I *really* need your comment

From Stack Overflow (shortened for clarity):

I have two tables: homes and home_photos:

It's very possible that a home does not have a photo associated with the home.

In that case, the column primary_photo_group_id will have value 0.

Otherwise, primary_photo_group_id equals the group_id of the photo.

How do I return a NULL if the photo if not defined for a home?

This is of course a classic task for an OUTER JOIN:

SELECT  *
FROM    homes h
LEFT JOIN
        photos p
ON      p.home_photo_group_id = h.primary_photo_group_id
        AND …

This construct, however, can be optimized, and I'll show how.

Let's create and fill the sample tables:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_data (
        id INT NOT NULL PRIMARY KEY,
        lookup INT NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_lookup (
        id INT NOT NULL PRIMARY KEY,
        value INT NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_data (id, lookup)
SELECT  id, CASE WHEN RAND(20090624) < 0.9 THEN 0 ELSE FLOOR(RAND(20090624 << 1) * 100) + 1 END
FROM    filler;

INSERT
INTO    t_lookup (id, value)
SELECT  id, id * 10
FROM    filler
ORDER BY
        id
LIMIT 100;
&#91;/sourcecode&#93;

We have two tables: <code>t_data</code> and <code>t_lookup</code>.

Table <code>t_lookup</code> has <strong>100</strong> rows with <code>id</code>'s from <strong>1</strong> to <strong>100</strong>.

<code>t_data</code> has <strong>1,000,000</strong> rows, and the values of the column <code>lookup</code> are random, with <strong>90%</strong> probability to be a <strong>0</strong>, and <strong>10%</strong> probability to have a positive value from <strong>1</strong> to <strong>100</strong>.

Thus, if we join these two tables with an <code>OUTER JOIN</code>, only <strong>10%</strong> of rows from <code>t_data</code> will correspond to rows from <code>t_lookup</code>.

Let's make a query:


SELECT  SUM(value)
FROM    t_data d
LEFT JOIN
        t_lookup l
ON      l.id = d.lookup

SUM(value)
50149460
1 row fetched in 0.0001s (4.3593s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	d	ALL					1000552	100.00
1	SIMPLE	l	eq_ref	PRIMARY	PRIMARY	4	20090624_outer.d.lookup	1	100.00

The query runs for 4.35 seconds.

Now, we have a 0 to mark a missing value.

When we do an OUTER JOIN, MySQL needs to check each occurrence of 0 against t_lookup.

Column id is a PRIMARY KEY on t_lookup, and each of these checks is very fast, just a few microseconds. But since there are about 900,000 such checks that need to be made, this becomes quite slow.

But a 0 should not be checked against t_lookup at all. The table is designed in such way that there are no 0's in t_lookup, and the check is useless.

That's why a NULL should be used to designate a missing value in a JOIN. Since an equality condition is never true on a NULL, MySQL will not even bother to check this value against t_lookup:

SELECT  SUM(value)
FROM    t_data d
LEFT JOIN
        t_lookup l
ON      l.id = NULLIF(d.lookup, 0)

SUM(value)
50149460
1 row fetched in 0.0001s (1.1094s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	d	ALL					1000552	100.00
1	SIMPLE	l	eq_ref	PRIMARY	PRIMARY	4	func	1	100.00

This query takes only 1.10 seconds: 4 times as fast as the original one.

One should always cast a missing value marker to a NULL to make an OUTER JOIN more efficient.

It would be probably more elegant to denote a missing value with a NULL in a table column itself, but this would make the index key on the column one byte longer, which would hurt performance a little.

Besides, NULLIF construct itself is quite compact and nice.

Written by Quassnoi

June 24th, 2009 at 11:00 pm

Posted in MySQL

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

EXPLAIN EXTENDED