NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL at EXPLAIN EXTENDED

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

Let's see for PostgreSQL 8.4.

To do this we will create two sample tables:

Table creation details

CREATE TABLE t_left (
        id INT NOT NULL PRIMARY KEY,
        value INT NOT NULL,
        stuffing VARCHAR(200) NOT NULL
);
CREATE TABLE t_right (
        id INT NOT NULL PRIMARY KEY,
        value INT NOT NULL,
        stuffing VARCHAR(200) NOT NULL
);

CREATE INDEX ix_left_value ON t_left (value);
CREATE INDEX ix_right_value ON t_right (value);

INSERT
INTO    t_left
SELECT  s, s % 10000, RPAD('Value ' || s || ' ' , 200, '*')
FROM    generate_series(1, 100000) s;

INSERT
INTO    t_right
SELECT  s, s % 10000 + 1, RPAD('Value ' || s || ' ' , 200, '*')
FROM    generate_series(1, 1000000) s;

ANALYZE t_left;
ANALYZE t_right;

Table t_left contains 100,000 rows with 10,000 distinct values.

Table t_right contains 1,000,000 rows with 10,000 distinct values.

There are 10 rows in t_left with values not present in t_right.

NOT IN

SELECT  l.id, l.value
FROM    t_left l
WHERE   value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

Query results and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0004s (1.3437s)

Seq Scan on t_left l  (cost=42804.00..47085.00 rows=50000 width=8)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on t_right r  (cost=0.00..40304.00 rows=1000000 width=4)

NOT IN, as we discussed earlier, is a special case due to the way it treats NULL values in the list.

Unfortunately, PostgreSQL's optimizer cannot use the fact that t_right.value is defined as NOT NULL and therefore the list can return no NULL values.

That's why for NOT IN queries PostgreSQL uses a special access method called hashed Subplan:

First, it executes the list query (using Seq Scan on t_right) and hashes its results. The duplicate values are eliminated from the hash
Then it takes each row from t_left (again, using Seq Scan) and applies a filtering condition to each value
Filtering condition works as follows:
- If t_left.value is NULL, it returns NULL right away.
- If t_left.value is not NULL, it is searched for in the hash table generated on step 1:
  - If the value is found in the list, TRUE is returned.
  - If the value is not found, the hash table is searched one more time to find a NULL value. Depending on whether or not a NULL was found, NULL or FALSE is returned

This is quite an efficient method by itself, but the need to look for the values twice somewhat degrades the query performance.

The query takes 1.34 s.

One more nasty side effect is that using a subplan is the only method PostgreSQL's optimizer can use for NOT IN constructs.

Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem, it will resort to using a mere subplan which will be either executed for each row from t_left, or materialized and the rows will be searched for in a loop.

This can be very bad since the optimizer will just change the plan as t_right will overgrow a certain limit and one day the query will just become very slow without any apparent reason.

NOT EXISTS

SELECT  l.id, l.value
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  value
        FROM    t_right r
        WHERE   r.value = l.value
        )

Query results and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0004s (1.1093s)

Hash Anti Join  (cost=52804.00..57960.43 rows=190 width=8)
  Hash Cond: (l.value = r.value)
  ->  Seq Scan on t_left l  (cost=0.00..4031.00 rows=100000 width=8)
  ->  Hash  (cost=40304.00..40304.00 rows=1000000 width=4)
        ->  Seq Scan on t_right r  (cost=0.00..40304.00 rows=1000000 width=4)

Now PostgreSQL is able to see an opportuninty for an ANTI JOIN and does it.

The plan is quite close to that for NOT IN: t_right is hashed and the hash table is looked up against all rows from t_left.

The difference is that any missing value should be only looked up once, that makes the query a little more efficient.

This query runs for 1.10 s.

The other difference is that Anti Join supports much wider range of join methods, namely Nested Loops and Merge, which can be efficiently used even if t_right grows too large to fit into the in-memory hash table.

LEFT JOIN / IS NULL

SELECT  l.id, l.value
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

Query results and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0004s (1.1250s)

Hash Anti Join  (cost=52804.00..57960.43 rows=190 width=8)
  Hash Cond: (l.value = r.value)
  ->  Seq Scan on t_left l  (cost=0.00..4031.00 rows=100000 width=8)
  ->  Hash  (cost=40304.00..40304.00 rows=1000000 width=4)
        ->  Seq Scan on t_right r  (cost=0.00..40304.00 rows=1000000 width=4)

LEFT JOIN / IS NULL uses exactly same execution plan as NOT EXISTS does and yields same results in same time.

PostgreSQL, unlike SQL Server, can use an Anti Join method to process a query that combines LEFT JOIN with IS NULL predicate on the join field from the right table.

As was discussed in the previous article, such a join can only return a row from t_left at most once, and it will return it any time it finds no corresponding row in t_right.

In other words, this condition is semantically equivalent to NOT EXISTS (but not to NOT IN), and PostgreSQL optimizes both LEFT JOIN / IS NULL and NOT EXISTS in the same way.

Summary

PostgreSQL treats LEFT JOIN and NOT EXISTS equally, using same execution plan for both of them (namely a Hash Anti Join for the example above).

As for NOT IN, which is semantically different since its logic is trivalent and it can return NULL, PostgreSQL tries to take this into account and limits itself to using a filter against a subplan (a hashed subplan for a hashable resultset like in example above).

Since it need to search the hash table for each missing value twice (first time to find the value, second time to find a NULL), this method is a little less efficient.

A plain subplan, which the optimizer can resort to any time it decides the list will not fit into the memory, is very inefficient and the queries that have possibility of using it should be avoided like a plague.

That's why in PostgreSQL 8.4 one should always use LEFT JOIN / IS NULL or NOT EXISTS rather than NOT IN to find the missing values.

Written by Quassnoi

September 16th, 2009 at 11:00 pm

Posted in PostgreSQL

4 Responses to 'NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL'

Subscribe to comments with RSS

Use NOT EXISTS or NOT IN very slow

WHen using
SELECT l.id, l.value
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL

very fast

Thanks

ramoreez

8 Nov 12 at 13:21
@ramoreez: care to back your statement with some reproducible scenario?

Quassnoi

8 Nov 12 at 13:25
Worked For me…. for bust of data thanks @ramoreez

Dumbo Pravin

24 Feb 17 at 00:27
uses:

SELECT l.*
FROM t_left l
LEFT JOIN LATERAL (select 1 as ID from t_right r where r.value = l.value limit 1)r on true
WHERE r.id IS NULL

Amir Gabdulkhakov

20 Apr 17 at 12:09

EXPLAIN EXTENDED