Constant vs. NULL to mark missing values in OUTER JOINs
Comments enabled. I *really* need your comment
From Stack Overflow (shortened for clarity):
I have two tables:
homesandhome_photos:It's very possible that a home does not have a photo associated with the home.
In that case, the column
primary_photo_group_idwill have value 0.Otherwise,
primary_photo_group_idequals thegroup_idof the photo.How do I return a
NULLif the photo if not defined for a home?
This is of course a classic task for an OUTER JOIN:
SELECT *
FROM homes h
LEFT JOIN
photos p
ON p.home_photo_group_id = h.primary_photo_group_id
AND …
This construct, however, can be optimized, and I'll show how.
Let's create and fill the sample tables:
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_data (
id INT NOT NULL PRIMARY KEY,
lookup INT NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
CREATE TABLE t_lookup (
id INT NOT NULL PRIMARY KEY,
value INT NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;
INSERT
INTO t_data (id, lookup)
SELECT id, CASE WHEN RAND(20090624) < 0.9 THEN 0 ELSE FLOOR(RAND(20090624 << 1) * 100) + 1 END
FROM filler;
INSERT
INTO t_lookup (id, value)
SELECT id, id * 10
FROM filler
ORDER BY
id
LIMIT 100;
[/sourcecode]
We have two tables: <code>t_data</code> and <code>t_lookup</code>.
Table <code>t_lookup</code> has <strong>100</strong> rows with <code>id</code>'s from <strong>1</strong> to <strong>100</strong>.
<code>t_data</code> has <strong>1,000,000</strong> rows, and the values of the column <code>lookup</code> are random, with <strong>90%</strong> probability to be a <strong>0</strong>, and <strong>10%</strong> probability to have a positive value from <strong>1</strong> to <strong>100</strong>.
Thus, if we join these two tables with an <code>OUTER JOIN</code>, only <strong>10%</strong> of rows from <code>t_data</code> will correspond to rows from <code>t_lookup</code>.
Let's make a query:
SELECT SUM(value)
FROM t_data d
LEFT JOIN
t_lookup l
ON l.id = d.lookup
| SUM(value) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50149460 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (4.3593s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | d | ALL | 1000552 | 100.00 | |||||
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | 20090624_outer.d.lookup | 1 | 100.00 |
The query runs for 4.35 seconds.
Now, we have a 0 to mark a missing value.
When we do an OUTER JOIN, MySQL needs to check each occurrence of 0 against t_lookup.
Column id is a PRIMARY KEY on t_lookup, and each of these checks is very fast, just a few microseconds. But since there are about 900,000 such checks that need to be made, this becomes quite slow.
But a 0 should not be checked against t_lookup at all. The table is designed in such way that there are no 0's in t_lookup, and the check is useless.
That's why a NULL should be used to designate a missing value in a JOIN. Since an equality condition is never true on a NULL, MySQL will not even bother to check this value against t_lookup:
SELECT SUM(value)
FROM t_data d
LEFT JOIN
t_lookup l
ON l.id = NULLIF(d.lookup, 0)
| SUM(value) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50149460 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (1.1094s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | d | ALL | 1000552 | 100.00 | |||||
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 |
This query takes only 1.10 seconds: 4 times as fast as the original one.
One should always cast a missing value marker to a NULL to make an OUTER JOIN more efficient.
It would be probably more elegant to denote a missing value with a NULL in a table column itself, but this would make the index key on the column one byte longer, which would hurt performance a little.
Besides, NULLIF construct itself is quite compact and nice.
Subscribe in a reader