Constant vs. NULL to mark missing values in OUTER JOINs
Comments enabled. I *really* need your comment
From Stack Overflow (shortened for clarity):
I have two tables:
homes
andhome_photos
:It's very possible that a home does not have a photo associated with the home.
In that case, the column
primary_photo_group_id
will have value 0.Otherwise,
primary_photo_group_id
equals thegroup_id
of the photo.How do I return a
NULL
if the photo if not defined for a home?
This is of course a classic task for an OUTER JOIN
:
SELECT * FROM homes h LEFT JOIN photos p ON p.home_photo_group_id = h.primary_photo_group_id AND …
This construct, however, can be optimized, and I'll show how.
Let's create and fill the sample tables:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_data ( id INT NOT NULL PRIMARY KEY, lookup INT NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE t_lookup ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_data (id, lookup) SELECT id, CASE WHEN RAND(20090624) < 0.9 THEN 0 ELSE FLOOR(RAND(20090624 << 1) * 100) + 1 END FROM filler; INSERT INTO t_lookup (id, value) SELECT id, id * 10 FROM filler ORDER BY id LIMIT 100; [/sourcecode] We have two tables: <code>t_data</code> and <code>t_lookup</code>. Table <code>t_lookup</code> has <strong>100</strong> rows with <code>id</code>'s from <strong>1</strong> to <strong>100</strong>. <code>t_data</code> has <strong>1,000,000</strong> rows, and the values of the column <code>lookup</code> are random, with <strong>90%</strong> probability to be a <strong>0</strong>, and <strong>10%</strong> probability to have a positive value from <strong>1</strong> to <strong>100</strong>. Thus, if we join these two tables with an <code>OUTER JOIN</code>, only <strong>10%</strong> of rows from <code>t_data</code> will correspond to rows from <code>t_lookup</code>. Let's make a query: SELECT SUM(value) FROM t_data d LEFT JOIN t_lookup l ON l.id = d.lookup
SUM(value) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50149460 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (4.3593s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | d | ALL | 1000552 | 100.00 | |||||
1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | 20090624_outer.d.lookup | 1 | 100.00 |
The query runs for 4.35 seconds.
Now, we have a 0 to mark a missing value.
When we do an OUTER JOIN
, MySQL needs to check each occurrence of 0 against t_lookup
.
Column id
is a PRIMARY KEY
on t_lookup
, and each of these checks is very fast, just a few microseconds. But since there are about 900,000 such checks that need to be made, this becomes quite slow.
But a 0 should not be checked against t_lookup
at all. The table is designed in such way that there are no 0
's in t_lookup
, and the check is useless.
That's why a NULL
should be used to designate a missing value in a JOIN
. Since an equality condition is never true on a NULL
, MySQL will not even bother to check this value against t_lookup
:
SELECT SUM(value) FROM t_data d LEFT JOIN t_lookup l ON l.id = NULLIF(d.lookup, 0)
SUM(value) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50149460 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (1.1094s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | d | ALL | 1000552 | 100.00 | |||||
1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 |
This query takes only 1.10 seconds: 4 times as fast as the original one.
One should always cast a missing value marker to a NULL
to make an OUTER JOIN
more efficient.
It would be probably more elegant to denote a missing value with a NULL
in a table column itself, but this would make the index key on the column one byte longer, which would hurt performance a little.
Besides, NULLIF
construct itself is quite compact and nice.