NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.
Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Differences between the methods
These methods are quite different.
First of all, LEFT JOIN / IS NULL
and NOT EXISTS
are semantically equivalent, while NOT IN
is not. These method differ in how they handle NULL
values in t_right
LEFT JOIN
is guaranteed to return every row from t_left
, and then filtering is applied to the values returned from t_right
. If for some row in t_left
there is no corresponding row in t_right
(which means no row with that exact value is present in t_right
), the row from t_left
will be returned once, and the NULL
values will be substituted instead of t_right
's actual values.
Since NULL
values can never satisfy an equality JOIN
condition, the NULL
values returned by the query are guaranteed to be substituted by the LEFT JOIN
, not fetched out of the actual t_right
's row. This means that LEFT JOIN / IS NULL
is guaranteed to return at most one row from t_left
, and these row's value
is not equal to one of those in t_right
.
The same holds for NOT EXISTS
. Since it's a predicate, not a JOIN
condition, the rows from t_left
can only be returned at most once too. EXISTS
always returns TRUE
or FALSE
and it will return TRUE
as soon as it finds only a single matching row in t_right
, or FALSE
, if it find none.
NOT EXISTS
, therefore, will return TRUE
only if no row satisfying the equality condition is found in t_right
(same as for LEFT JOIN / IS NULL
).
Note that NULL
values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL
and NOT EXISTS
will always return rows from t_left
that have value
set to NULL
, even is there are rows with value IS NULL
in t_right
.
NOT IN
, however, behaves differently.
IN
predicate (unlike EXISTS
) is trivalent, i. e. it can return TRUE
, FALSE
or NULL
:
TRUE
is returned when the non-NULL
value in question is found in the listFALSE
is returned when the non-NULL
value is not found in the list and the list does not containNULL
valuesNULL
is returned when the value isNULL
, or the non-NULL
value is not found in the list and the list contains at least oneNULL
value
IN
predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL
values on either side of the expression, returning NULL
instead.
This of course makes no difference when using the positive form of NULL
: predicates returning NULL
are filtered out by the WHERE
clause as well as those returning FALSE
.
However, NOT IN
is different, since negation of NULL
is NULL
as well.
That's why NOT IN
condition will never hold for any list with a NULL
value in it.
- If a row is found in the list,
IN
will returnTRUE
andNOT IN
, therefore, will returnFALSE
- If a row is not found in the list,
IN
will returnNULL
, andNOT IN
on its turn will also returnNULL
Both conditions will of course be filtered out by the WHERE
clause.
Let's illustrate it with two simple queries that compare (1, NULL)
in t_left with (2, NULL)
in t_right
:
WITH t_left AS ( SELECT 1 AS value UNION ALL SELECT NULL ), t_right AS ( SELECT 2 AS value UNION ALL SELECT NULL ) SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NULL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0001s (0.0006s) |
This query, using NOT EXISTS
, returns both values from t_left
, since neither of them is equal to any of the values from t_right
.
WITH t_left AS ( SELECT 1 AS value UNION ALL SELECT NULL ), t_right AS ( SELECT 2 AS value UNION ALL SELECT NULL ) SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right )
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 rows fetched in 0.0001s (0.0005s) |
This query, on the other hand, returns nothing. Since there is a NULL
in t_right
, NOT IN
returns NULL
rather than TRUE
if the value is not found among the defined values. Just in case.
IN
(and NOT IN
) are too chicken to say something definite about lists with NULL
unless they are completely sure that the value is there.
However, if the values in both tables are non-nullable, NULL
, all three method describe above are semantically identical.
Efficiency comparison
Let's see how efficient are these methods.
To do that, we will create two sample tables:
Table t_left
contains 100,000 rows with 10,000 distinct values.
Table t_right
contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left
with values not present in t_right
.
Let's run the queries against these tables.
NOT IN
SELECT l.id, l.value FROM [20090915_anti].t_left l WHERE l.value NOT IN ( SELECT value FROM [20090915_anti].t_right r )
View query results, details and execution plan
As we can see, this query uses Merge Anti Semi Join
which is extremely efficient if there is a cheap way to obtain two ordered resultsets (like in example above). Since value
is indexed in both tables, the indexes serve as such resulsets.
Merge Join
means that the server iterates both resultsets from lower values to higher ones, keeping a pointer
to the current value and advancing it in both resultsets.
Anti Semi Join
above means that as soon as the engine meets a match in t_right
it just skips all matching values in both t_left
and t_right
. Since values from t_right
are pregrouped using Stream Aggregate
(making the right resultset 100 times as small), the values are only skipped in t_left
(10 at once).
The whole query takes as little as 0.271 s.
NOT EXISTS
SELECT l.id, l.value FROM [20090915_anti].t_left l WHERE NOT EXISTS ( SELECT NULL FROM [20090915_anti].t_right r WHERE r.value = l.value )
View query results, details and execution plan
Exactly same plan and exactly same execution time as above.
In SQL Server, NOT IN
and NOT EXISTS
are complete synonyms in terms of the query plans and execution times (as long as both columns are NOT NULL
).
LEFT JOIN / IS NULL
SELECT l.id, l.value FROM [20090915_anti].t_left l LEFT JOIN [20090915_anti].t_right r ON r.value = l.value WHERE r.value IS NULL
View query results, details and execution plan
Here, the results are the same but performance details are very different.
SQL Server's optimizer cannot discern an ANTI JOIN
in a LEFT JOIN / IS NULL
construct.
That's why it just build the complete resultset (as with a common LEFT JOIN
) and filters out the matching values.
Since we have lots of values to filter in this case (almost 10,000,000), it's a hard job to filter such a tremendous lot of values. This operation is performed using quite an efficient Hash Match
which can be and is parallelized, but filtering the values out still takes the most time.
That's why the LEFT JOIN / IS NULL
query takes 810 ms, or 3 times as much as the NOT EXISTS
/ NOT IN
query.
Summary
In SQL Server, NOT EXISTS
and NOT IN
predicates are the best way to search for missing values, as long as both columns in question are NOT NULL
. They produce the safe efficient plans with some kind of an Anti Join
.
LEFT JOIN / IS NULL
is less efficient, since it makes no attempt to skip the already matched values in the right table, returning all results and filtering them out instead.
Just a note, it looks like a typo in the line above the Summary. It should read “LEFT JOIN / IS NULL” instead of “LEFT JOIN / NOT IN”
dansan
7 Feb 13 at 18:11
Oh, thanks, fixed!
Quassnoi
7 Feb 13 at 18:16
excellent post! I learn a lot from this post. it clarifies my confusion. thanks a lot!
ames
21 Mar 13 at 22:11
thanks for the post! a question about LEFT JOIN / IS NULL vs. NOT EXISTS on nullable columns: is NOT EXISTS still more efficient in that case?
Michael
9 Apr 13 at 19:48
I was playing with queries and I found that IN,JOIN and EXISTS returns same result.Is this expected behaviour? What did I miss? My example code:
DECLARE @temp1 TABLE(id INT, name NVARCHAR(255))
INSERT INTO @temp1
SELECT 1,’NAME1′ UNION ALL
SELECT 2,’NAME2′ UNION ALL
SELECT 3,’NAME3′ UNION ALL
SELECT 4,’NAME4′ UNION ALL
SELECT NULL,’NAME5′
DECLARE @temp2 TABLE(id INT, name NVARCHAR(255))
INSERT INTO @temp2
SELECT 1,’DATA1′ UNION ALL
SELECT 2,’DATA2′ UNION ALL
SELECT 3,’DATA3′ UNION ALL
SELECT 4, NULL UNION ALL
SELECT 5, NULL UNION ALL
SELECT 6,’DATA6′
SELECT t1.* FROM @temp1 t1 JOIN @temp2 t2 ON t1.id=t2.id
SELECT t1.* FROM @temp1 t1 WHERE t1.id IN(SELECT id FROM @temp2)
SELECT t1.* FROM @temp1 t1 WHERE EXISTS(SELECT 1 FROM @temp2 t2 WHERE t1.id=t2.id)
AMEL
12 Apr 13 at 12:21
@AMEL: what did you expect to see?
Quassnoi
12 Apr 13 at 20:29
First, this post was really helpful to understand the equivalence conditions.
Now, what would be the equivalent using a JOIN of the more general NOT IN query:
1.SELECT l.*
2.FROM t_left l
3.WHERE l.value NOT IN
4. (
5. SELECT value
6. FROM t_right r
7. WHERE r.value2 = 1
8. )
9. AND l.value3 is not null
*It can be assumed that r.value2 is not nullable as well.
*l.value3 and r.value2 are not comparable (although there might exist a correlation)
*l.value3 is nullable (which I think is not a big deal because the condition is out of the NOT IN clause).
I am actually in a situation where using join would certainly be helpful. The full story (understand original SQL query) is long but to make it short : I am trying to simplify a query with abusive use of UNION between subqueries which are the same only ran on disjoint sets(I suspect), one of those is using a not in clause when all the others can be simplified with a join
I have done quite a few tests and even with the simpler version without the l.value3 it seems I can’t find the equivalent.
using relational algebra and classical logic equivalence here is the closest I got:
1.SELECT l.*
2.FROM t_left l
3.WHERE l.value NOT IN
4. (
5. SELECT value
6. FROM t_right r
7. WHERE r.value2 = 1
8. )
Should be almost equivalent to
1.SELECT l.*
2.FROM t_left l
3. LEFT JOIN t_right r on r.value = l.value
4.WHERE r.value is null
5. OR r.value2 1
I have actually 1 more row in the JOIN version (total nb of row ~35000).
I thought that it might be because some values in one of the tables was null so I checked all the tables. For the values included in the join and/or conditions none is null.
It is actually because when using NOT IN it suffice that one line has the value2 to exclude the r.value whereas when using the JOIN it suffices that one line doesn’t have the value to include them: in short with NOT IN if exists r.value2 = 1 then you exclude it. So with the join you would want when r.value is not null For Each r.value ALL r.value2 1 (this part is hard to explain: just remember that r.value can have multiple r.value2 associated to it and we want all of them to be different from 1)
So now I am stuck searching for a way to express existential and universal conditions.
I think I will just forget about factorizing the query for now.
Sorry for the long comment but I was actually trying to find the solution at the time I was writing it, so I included all the progresses I made. I just hope it can be a good start for someone more patient than me.
ckcbc
18 Apr 13 at 11:46
Very helpful post, thanks for all the detailed explanations!
I realize this is an older post, but have you thought about adding EXCEPT as another comparison? I’ve seen a good number of questions on SO looking for the difference between EXCEPT and NOT IN or NOT EXISTS. Either way, this post definitely helped me, thanks!
Stu
5 Dec 13 at 04:42
@Stu: EXCEPT is a little bit out of this league as it removes duplicates, but at the core it’s an anti-join as well, similar to NOT IN and NOT EXISTS.
Quassnoi
5 Dec 13 at 14:37
Thanks a lot for the post.
In my case, using LEFT JOIN/IS NULL returned 5 times faster than using NOT EXISTS or NOT IN, having 20000 and 4000 rows in the table1 and table2 respectively. The join was on not null columns for selecting all the rows in table1 not in table2.
Rav
10 Dec 13 at 15:48
I realize this is an old post, but hopefully the author is still reading it.
Very interesting data. How would this apply to a multi-column match between the tables (ex. Table1.Col1 = Table2.Col1 AND Table1.Col2 = Table2.Col2) to confirm a missing record from one side?
I suspect, the left join will work better for the multi-column matching.
Chris
14 Feb 14 at 20:55
many thanks for this article. it’s make clear my mind for confuse about NOT IN , NOT EXISTS, LEFT JOIN. caused today I make script php to save data, before add in table, I must check. this data there or not. if not in previous,I’ll add it.
ip checker
22 May 14 at 07:12
Have a CTE I am doing a left Join on and looking for the records I do not match on.
When I do a select * ON the CTE I Should get no Records AND I DO. BUT WHEN I DO SELECT LIKE CUST_ID, COMPANY_ID FROM CTE I GET RECORDS WHEN I SHOULD NOT.
WITH A_CONTRACT_TO AS
(
SELECT
CPRD.CPDELT,
‘T’ AS CXCHGT,
CPRD.CPUPDT AS TO_CPUPDT,
CAST(120000 + ROW_NUMBER() OVER (PARTITION BY CPRD.CPCMPY,CPRD.CPCST# ORDER BY CPRD.CPCMPY,CPRD.CPCST#) AS DECIMAL(7,0)) AS CXTIME,
CPRD.CPUSER,
CPRD.CPCMPY,
CPRD.CPCST#,
CPRD.CPSIZE,
CPRD.CPCHG,
CPRD.CPEFDT,
CPRD.CPEXDT,
CPRD.CPFLAT,
CPRD.CPRATE,
CPRD.CPDESC,
CPRD.CPSVPR,
CPRD.CPUSER AS TO_CPUSER,
CPRD.CPUPDT,
CPRD.CPUPTM,
CPRD.CPUPGM,
CPRD.CPADDT,
CPRD.CPADTM,
CPRD.CPAUSR,
CPRD.CPAPGM,
CPRD.CPBSTN,
CPRD.CPMIN$,
CPRD.CPMAX$,
CPRD.CPRENT,
CPRD.CPMSC1,
CPRD.CPMSC2,
CPRD.CPNOTE,
CPRD.CPMATL,
CPRD.CPBTYP,
CPRD.CPSLSID,
LG.CPRATE AS LOG_CPRATE
FROM
IWSE4DI.CPRD CPRD
LEFT JOIN IWSE4DI.CPRDLG LG ON CPRD.CPCMPY = LG.CPCMPY AND CPRD.CPCST# = LG.CPCST# AND CPRD.CPSIZE = LG.CPSIZE AND CPRD.CPCHG = LG.CPCHG AND CAST(CPRD.CPFLAT AS DECIMAL(7,0)) = CAST(LG.CPFLAT AS DECIMAL(7,0))
LEFT JOIN IWSE4DI.CUST AS C ON CPRD.CPCMPY = C.CCMPNY AND CPRD.CPCST# = C.CCUST#
WHERE CPRD.CPCMPY = ’75’
AND CPRD.CPFLAT ‘0.00’
AND LG.CPFLAT is NULL
–AND LG.CPFLAT = 0
AND C.CCYCLE NOT IN (‘MS’,’M’, ‘MF’, ‘MG’,’MC’,’CC’, ‘TC’)
AND CPRD.CPUPDT > ‘0’
AND CPRD.CPCST# = ’12’ –‘738′ 127887
AND CPRD.CPSIZE = ’30’
AND CPRD.CPCHG = ‘WN’
)
–SELECT * FROM A_CONTRACT_TO
SELECT
*
— A_CONTRACT_TO.CPCMPY,
— A_CONTRACT_TO.CPCST#
FROM A_CONTRACT_TO
Michael Webb
11 Sep 14 at 22:20
Very helpful article. A colleague had advised using NOT EXISTS instead of a LEFT JOIN and in one case I saw a query take 17 minutes come down to 43 seconds.
This article really helped to explain how the queries run and why there are efficiency savings.
Nick
19 Nov 14 at 20:01
Extremely helpful article. I didn’t realize the trivalent nature of NOT IN and was beating my head against a wall trying to isolate the problem. Thanks!!
Joe
16 Jan 15 at 09:43
Very interesting.
I am a little bit skeptic about it because in Oracle using not in would kill everything.
Usually all the code that was written with not in I rewrote using minus or except in sql with major improvements.
If i am searching for non existing data then i know it does not exist but if it exists this would kill performance.
As far as I remember not in is recommended only if the not in sql has a small number of data or in list.
titus
14 Apr 15 at 13:52
@titus: what exactly it is you’re skeptic about?
Quassnoi
14 Apr 15 at 21:53
Is there any indication as to whether the In/Exists methods are still better from a performance standpoint over the Join/IsNull method when more than one field needs to be checked for the comparison. So, let’s say the subquery contains a number of comparisons in the Where clause.
Al
10 Aug 15 at 20:14
@Al: SQL Server does not allow multiple columns IN, but I doubt EXISTS and LEFT JOIN plans would be any different with multiple columns.
Quassnoi
10 Aug 15 at 22:15
Hey, your article was really helpful, thanks! :)
But which way works best with extremely big tables? I have to write a program, to find the 10 outdatet rows in like 500k. It works already pretty well with the left join.
Do you think there is a way that is more effective, with better performance?
greetings michael
Michael
11 Aug 15 at 10:23
You, are a genius, and have just helped me reduce execution time on a troublesome function by about 5000%.
The side effect of this is that I can now sleep properly, and tell my shouting customer to shut up.
The side effect of that meaning I’m out tonight on the beer!
If you were in my office right now I would even risk a bear hug…(!)
Dave
3 Sep 15 at 15:15
This here is a great article on the *fastest* way to insert a record where one doesn’t already exist:
http://cc.davelozinski.com/sql/fastest-way-to-insert-new-records-where-one-doesnt-already-exist
for those speed freaks amongst us. :-)
Johnny Boy
15 Oct 15 at 08:59
Thanks for sharing the information. Help full for me
Zubair
11 Sep 16 at 13:02
Thanks for that article.
Are you sure, that in this case NOT EXISTS and NOT IN do perform equally good?
When using the opposite check, i.e. EXISTS and IN, one should always choose EXISTS because it skips the rest after finding the first match.
That said, I had expected NOT EXISTS also to be the better practice.
Of course, finding matches or excluding makes a difference, but I think i may have to test this to proove it. ;)
e82
16 Sep 16 at 16:06
@e82: SQL Server optimizes equivalent EXISTS and IN queries to exactly same plans.
Quassnoi
23 Sep 16 at 23:29
This helped me out a ton! Most other places I was looking did not explain the JOIN on multiple tables clearly enough. Thank you!
Here’s my code if anyone is curious:
SELECT CurriculumImport.CourseNum, CurriculumImport.CL, CurriculumImport.RQ, CurriculumImport.RG, CurriculumImport.CourseID,
CurriculumImport.CourseTitle, CurriculumImport.SE, CurriculumImport.SR, CurriculumImport.SRCourseID, CurriculumImport.CLDescription, CurriculumImport.SchoolYear,
CurriculumImport.SCH, CurriculumImport.RGDescription, CurriculumImport.Attribute, CurriculumImport.AttributeValue
FROM CurriculumImport
LEFT JOIN Courses
ON (CurriculumImport.CourseNum= Courses.CourseNum
AND CurriculumImport.CL = Courses.CL
AND CurriculumImport.RQ = Courses.RQ
AND CurriculumImport.RG = Courses.RG)
WHERE (Courses.CourseNum IS NULL
OR Courses.CL IS NULL
OR Courses.RQ IS NULL
OR Courses.RG IS NULL);
Andrew
5 Jul 17 at 16:33