NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Today, we will see how Oracle copes with these queries.
And to do this, we, of course, should create sample tables:
Table t_left
contains 100,000 rows with 10,000 distinct values.
Table t_right
contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left
with values not present in t_right
.
In both tables the field value
is indexed.
LEFT JOIN / IS NULL
SELECT l.value, l.id FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL
View query results and execution plan
This execution plan is quite interesting.
First, Oracle's optimizer, unlike SQL Server's one, is smart enough to see an opportunity to use ANTI JOIN
for such a query.
Since all rows from t_left should be examined, Oracle decided to use a HASH ANTI JOIN
to do this: a hash table is built over the values from t_right
, eliminating duplicates, and every row from t_left
is searched for in the hash table. Rows not found in the hash table are returned; those that are found are not.
Second, Oracle decided to optimize the query yet a little.
In the table we actually have three fields: id
(a PRIMARY KEY
), value
we are searching for, and stuffing
. The latter field is a VARCHAR(200)
and filled with asterisks. This field emulates values in actual tables that makes the tables large in size.
Since the tables in Oracle are heap organized by default, no index contains values of both id
and value
.
However, there are two indexes that contain them separately: the index on value
created explicitly, and the index on id
created implicitly to police the PRIMARY KEY
.
This indexes, even taken together, are still much smaller than the table itself. And that's why Oracle decided to retrieve the values not from the table but from the indexes joined on rowid
.
Indeed, each of the index records corresponds to a record in the table. And index record contains a rowid
: a pointer to the original record in the table. Having equal rowid
's means belonging to the same row. And rowid
is of course unique in each index, since one table record corresponds to exactly one index record.
Since we don't use other columns except id
and value
, these two indexes contain all information we need. And joining them, though takes some extra time, is still faster than scanning the whole table (which as we know is quite large and occupies a lot of data blocks).
The HASH JOIN
we see in the plan does the things described above. Here are the steps it performs:
- First, it scans one of the two indexes with
FAST FULL SCAN
. This scan methods is faster since it does not follow theB-Tree
links but instead just reads all index pages sequentially. This does not preserve the index order but we don't need it anyway. - Then it builds a hash table from the index records read on step 1 (using
ROWID
's of the records as a hash condition) - Finally, it reads the second index (again, with a
FAST FULL SCAN
) and probes it against the hash table
This results in the same rowset of (id, value)
we would obtain had we scanned the table, only the HASH JOIN
method is faster.
The query completes in 0.27 s.
NOT IN
SELECT l.id, l.value FROM t_left l WHERE value NOT IN ( SELECT value FROM t_right )
View query results and execution plan
NOT IN
is exactly what ANTI JOIN
is designed for, that's why it's no surprise to see that Oracle uses ANTI JOIN
for NOT IN
.
NOT IN
is semantically different from the LEFT JOIN / IS NULL
and NOT EXISTS
since its logic is trivalent and filtering on it will never return anything if there are NULL
values in the list.
This behaviour was explained in the first article of the series (that concerning SQL Server) and this is a reason why PostgreSQL is reluctant to use an Anti Join
for NOT IN
.
However, Oracle is able to take into account the fact that t_right.value
is declared as NOT NULL
, and, therefore, NOT IN
is semantically equivalent to LEFT JOIN / IS NULL
and NOT EXISTS
. And Oracle uses exactly same plan for NOT IN
, with an ANTI JOIN
and a HASH JOIN
to get (id, value)
for t_left
.
The query completes in 0.28 s, same as for LEFT JOIN / IS NULL
.
NOT EXISTS
SELECT l.id, l.value FROM t_left l WHERE NOT EXISTS ( SELECT value FROM t_right r WHERE r.value = l.value )
View query results and execution plan
Guess what?
Same execution plan, same results, same performance.
The query completes in 0.28 s.
Summary
Oracle's optimizer is able to see that NOT EXISTS
, NOT IN
and LEFT JOIN / IS NULL
are semantically equivalent as long as the list values are declared as NOT NULL
.
It uses same execution plan for all three methods, and they yield same results in same time.
In Oracle, it is safe to use any method of the three described above to select values from a table that are missing in another table.
However, if the values are not guaranteed to be NOT NULL
, LEFT JOIN / IS NULL
or NOT EXISTS
should be used rather than NOT IN
, since the latter will produce different results depending on whether or not there are NULL
values in the subquery resultset.
This is great stuff, thanks.
KJ
26 Jan 13 at 08:20
Awesome! They are fully considered. Thanks for sharing.
RS
31 Jan 13 at 09:48
Good explanation, thanks.
dekano
5 Feb 16 at 14:12
Excellent explanation with examples…
Shivselvamurugan S
17 Mar 16 at 14:06
Hello. I am just beginning to learn sql query optimization. Could you explain, how did you find out that rowid’s is used as hash condition but not value itself in the first example? Because I thing that there are physical rowid’s but not logycal ones and they are unique in the database. Or am I wrong? Thank you.
Alexandra
25 Feb 17 at 00:53
Alexandra, if you look at the execution plan of the first example, what Oracle joins are indexes, not tables, and there are two joins, a HASH JOIN and a HASH JOIN ANTI.
First, two indexes are joined, both indexing table t_left: one index has (id,rowid) pairs and the other has (value,rowid) pairs. These two indexes are joined on rowid.
The result, a joint-index now has (id,value,rowid) triplets, which is then anti-joined to a third index, indexing table t_right, with (value,rowid) pairs. This second join is indeed done on value.
Tamas
7 Mar 17 at 18:52
Nice explanation.
My development team introduced the following bug, seemingly easy to do, it produces the wrong results. Just worth noting when using the LEFT OUTER JOIN syntax.
SELECT l.value, l.id
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
AND r.value IS NULL
What they should have used was:-
SELECT l.value, l.id
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
John Ryan
24 Apr 17 at 14:13