NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Let's see for PostgreSQL 8.4.
To do this we will create two sample tables:
Table t_left
contains 100,000 rows with 10,000 distinct values.
Table t_right
contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left
with values not present in t_right
.
NOT IN
SELECT l.id, l.value FROM t_left l WHERE value NOT IN ( SELECT value FROM t_right r )
Query results and execution plan
NOT IN
, as we discussed earlier, is a special case due to the way it treats NULL
values in the list.
Unfortunately, PostgreSQL's optimizer cannot use the fact that t_right.value
is defined as NOT NULL
and therefore the list can return no NULL
values.
That's why for NOT IN
queries PostgreSQL uses a special access method called hashed Subplan
:
-
First, it executes the list query (using
Seq Scan
ont_right
) and hashes its results. The duplicate values are eliminated from the hash -
Then it takes each row from
t_left
(again, usingSeq Scan
) and applies a filtering condition to each value -
Filtering condition works as follows:
-
If
t_left.value
isNULL
, it returnsNULL
right away. -
If
t_left.value
is notNULL
, it is searched for in the hash table generated on step 1:-
If the value is found in the list,
TRUE
is returned. -
If the value is not found, the hash table is searched one more time to find a
NULL
value. Depending on whether or not aNULL
was found,NULL
orFALSE
is returned
-
If the value is found in the list,
-
If
This is quite an efficient method by itself, but the need to look for the values twice somewhat degrades the query performance.
The query takes 1.34 s.
One more nasty side effect is that using a subplan
is the only method PostgreSQL's optimizer can use for NOT IN
constructs.
Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem
, it will resort to using a mere subplan
which will be either executed for each row from t_left
, or materialized and the rows will be searched for in a loop.
This can be very bad since the optimizer will just change the plan as t_right
will overgrow a certain limit and one day the query will just become very slow without any apparent reason.
NOT EXISTS
SELECT l.id, l.value FROM t_left l WHERE NOT EXISTS ( SELECT value FROM t_right r WHERE r.value = l.value )
Query results and execution plan
Now PostgreSQL is able to see an opportuninty for an ANTI JOIN
and does it.
The plan is quite close to that for NOT IN
: t_right
is hashed and the hash table is looked up against all rows from t_left
.
The difference is that any missing value should be only looked up once, that makes the query a little more efficient.
This query runs for 1.10 s.
The other difference is that Anti Join
supports much wider range of join methods, namely Nested Loops
and Merge
, which can be efficiently used even if t_right
grows too large to fit into the in-memory hash table.
LEFT JOIN / IS NULL
SELECT l.id, l.value FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL
Query results and execution plan
LEFT JOIN / IS NULL
uses exactly same execution plan as NOT EXISTS
does and yields same results in same time.
PostgreSQL, unlike SQL Server, can use an Anti Join
method to process a query that combines LEFT JOIN
with IS NULL
predicate on the join field from the right table.
As was discussed in the previous article, such a join can only return a row from t_left
at most once, and it will return it any time it finds no corresponding row in t_right
.
In other words, this condition is semantically equivalent to NOT EXISTS
(but not to NOT IN
), and PostgreSQL optimizes both LEFT JOIN / IS NULL
and NOT EXISTS
in the same way.
Summary
PostgreSQL treats LEFT JOIN
and NOT EXISTS
equally, using same execution plan for both of them (namely a Hash Anti Join
for the example above).
As for NOT IN
, which is semantically different since its logic is trivalent and it can return NULL
, PostgreSQL tries to take this into account and limits itself to using a filter against a subplan
(a hashed subplan
for a hashable resultset like in example above).
Since it need to search the hash table for each missing value twice (first time to find the value, second time to find a NULL
), this method is a little less efficient.
A plain subplan
, which the optimizer can resort to any time it decides the list will not fit into the memory, is very inefficient and the queries that have possibility of using it should be avoided like a plague.
That's why in PostgreSQL 8.4 one should always use LEFT JOIN / IS NULL
or NOT EXISTS
rather than NOT IN
to find the missing values.
Use NOT EXISTS or NOT IN very slow
WHen using
SELECT l.id, l.value
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
very fast
Thanks
ramoreez
8 Nov 12 at 13:21
@ramoreez: care to back your statement with some reproducible scenario?
Quassnoi
8 Nov 12 at 13:25
Worked For me…. for bust of data thanks @ramoreez
Dumbo Pravin
24 Feb 17 at 00:27
uses:
SELECT l.*
FROM t_left l
LEFT JOIN LATERAL (select 1 as ID from t_right r where r.value = l.value limit 1)r on true
WHERE r.id IS NULL
Amir Gabdulkhakov
20 Apr 17 at 12:09