LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns
In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.
Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.
In SQL, there are at least three methods to implement it:
LEFT JOIN / IS NULL
SELECT o.* FROM outer o LEFT JOIN inner i ON i.value = o.value WHERE i.value IS NULL
NOT IN
SELECT o.* FROM outer o WHERE o.value NOT IN ( SELECT value FROM inner )
NOT EXISTS
SELECT o.* FROM outer o WHERE NOT EXISTS ( SELECT NULL FROM inner i WHERE i.value = o.value )
When inner.value
is marked as NOT NULL
, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in MySQL.
Now, what if inner.value
is not nullable and does contain some NULL
values?
Let's create some sample tables:
Table creation details
There are two identical MyISAM tables. Each of the tables contains 1,000,000 random values from 1 to 99,999 and also some NULL
values. There is an index on value
in both tables.
Now, let's check the queries.
NOT EXISTS
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE NOT EXISTS ( SELECT NULL FROM t_inner i WHERE i.val = o.val )
SUM(LENGTH(stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14600 | 73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (9.9061s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
Field or reference '20100527_anti.o.val' of SELECT #2 was resolved in SELECT #1 select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(exists(select NULL from `20100527_anti`.`t_inner` `i` where (`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`))))
The query completes in 9.9 seconds. As we can see, it is optimized to use the index on t_inner.val
and return on the first match.
LEFT JOIN / IS NULL
SELECT SUM(LENGTH(o.stuffing)), COUNT(*) FROM t_outer o LEFT JOIN t_inner i ON i.val = o.val WHERE i.id IS NULL
SUM(LENGTH(o.stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14600 | 73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (13.5154s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Not exists |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`id`)
The query semantics are the same as those of NOT EXISTS
, and we even see the Not exists
optimization in the plan, however this query performs much more poorly than NOT EXISTS
: 13 seconds. Why?
MySQL documentation on EXPLAIN
states that Not exists
is used to optimize the queries similar to the one we have just run: LEFT JOIN
with IS NULL
predicate applied to a non-nullable column.
MySQL is aware that such a predicate can only be satisfied by a record resulting from a JOIN
miss (i. e. when no matching record was found in the rightmost table) and stops reading records after first index hit.
However, this optimization is implemented in a way that is far from being perfect. Despite the fact that no actual value of id
can be returned by such a query, the engine still looks up id
in the table (since it's not a part of the index). We can see it in the plan: unlike NOT EXISTS
query, there is no Using index
for t_inner
. This means that a table lookup is performed.
Even we replace id
with val
in the query, it still performs poorly:
SELECT SUM(LENGTH(o.stuffing)), COUNT(*) FROM t_outer o LEFT JOIN t_inner i ON i.val = o.val WHERE i.val IS NULL
SUM(LENGTH(o.stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14600 | 73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (14.4997s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`val`)
This time, no table lookups are made but there is no Not exists
optimization either.
Despite the fact that the join condition eliminates possibility of an actual NULL
being returned by the query and any val IS NULL
reaching the WHERE
clause is a result of a join miss, MySQL still examines all records in t_inner
, not stopping after the first hit.
This had been submitted as a bug.
Now, what about NOT IN
?
NOT IN
Unlike the previous two queries that only differ in implementation, not in semantics, NOT IN
, being applied as is, would yield the different results.
NOT EXISTS
and IS NULL
are two-state predicates, they can only return TRUE
or FALSE
. NOT IN
is a three-state predicate: it can return TRUE
, FALSE
or NULL
.
NULL
value is returned in two cases:
- When
t_outer.value
being tested isNULL
- When at least one of
t_inner.value
isNULL
This means that having but a single NULL
in t_inner
would prevent the query from returning anything.
Naive approach
Let's see what happens if we just substitute NOT IN
instead of NOT EXISTS
:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val NOT IN ( SELECT val FROM t_inner i )
SUM(LENGTH(stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (10.3748s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
Since there are NULL
s in t_inner
, no record in t_outer
can satisfy the predicate.
MySQL does not optimize this very well. It takes but a single index scan to find out if there are NULL
values in t_inner
and return if they are, but for some reason MySQL still applies the condition to each record in t_outer
.
Naive approach, improved
With a little help from our side, this can be improved:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE NOT EXISTS ( SELECT NULL FROM t_inner i WHERE val IS NULL ) AND val NOT IN ( SELECT val FROM t_inner i )
SUM(LENGTH(stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0014s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | Impossible WHERE | ||||||||
3 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
2 | SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 4 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where 0
We added an explicit check for NULL
values. Since it's not correlated, MySQL could instantly prove it false, cache it and avoid the table scan at all.
Ignoring right side NULLs
Now, let's make a NOT IN
query that does not take the NULL
values in t_inner
into account:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val NOT IN ( SELECT val FROM t_inner i WHERE val IS NOT NULL )
SUM(LENGTH(stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13400 | 67 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (10.4060s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
This time, the query returns records, but not as many as the previous queries did.
We made an additional check for NULL
in t_inner
but not in t_outer
. There are some records in t_outer
that have a NULL
in val
. Both IN
and NOT IN
would evaluate to NULL
and WHERE
would filter them out.
We see another glitch in MySQL optimizer here: a Full scan on NULL key
applied. Since NOT IN
should always return TRUE
when the subquery returns no records (even if the value checked is a NULL
), on correlated queries a fullscan should be applied to check for the records and find out whether to return NULL
or FALSE
. However, in this case the IN
subquery is not correlated, so the check could only be performed once and cached, like with the LEFT JOIN
.
In our case the overhead would be negligible, since the subquery would return on the first match, but it could matter if we had more NULL
values in t_outer
.
Now, what if we want NULL
records on t_outer
to be returned as well? We just need to add an additional check for NULL
s.
Ignoring all NULL
s
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val IS NULL OR val NOT IN ( SELECT val FROM t_inner i WHERE val IS NOT NULL )
SUM(LENGTH(stuffing)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14600 | 73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (10.4842s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | ix_outer_val | 1000000 | 100.00 | Using where | |||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (isnull(`20100527_anti`.`o`.`val`) or (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`))))))))
Here, the query returns the same results as NOT EXISTS
.
Full scan on NULL key
is still present in the plan but will never actually be executed because it will be short circuited by the previous IS NULL
check.
Summary
As was shown in the earlier article, LEFT JOIN / IS NULL
and NOT IN
are best used to implement an anti-join in MySQL if the columns on both sides are not nullable.
The situation is different when the columns are nullable:
NOT EXISTS
performs in most straightforward way: just checks equality and returnsTRUE
orFALSE
on the first hit / miss.LEFT JOIN / IS NULL
either makes an additional table lookup or does not return on the first match and performs more poorly in both cases.NOT IN
, having different semantics, requires additional checks forNULL
values. These checks should be coded into the query
With nullable columns, NOT EXISTS
and NOT IN
(with additional checks for NULLS
) are the most efficient methods to implement an anti-join in MySQL.
LEFT JOIN / IS NULL
performs poorly.
In my case “right join where null” work faster, then left join.
But i search this article to prove it or not.
I have multi-rightjoin on except null.
And i think how turn over query, that not used “null” by all tmp table.
Probably it will be sum of each result of tables into new table.
Sasa
31 Aug 22 at 07:12