IN vs. JOIN vs. EXISTS: Oracle
Answering questions asked on the site.
Jason asks:
I have a co-worker who swears that Oracle
IN
queries are slow and refuses to use them. For example:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.
So, in general is an
IN
query very expensive? I'm having trouble putting together many non-trivial queries to run onEXPLAIN PLAN
.
Does the IN
predicate always have inferior efficiency compared to it's counterparts, EXISTS
and JOIN
?
Let's check.
First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:
IN
predicate:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )
EXISTS
predicate:SELECT foo FROM bar WHERE EXISTS ( SELECT NULL FROM asdf WHERE asdf.stuff = bar.stuff )
JOIN
/DISTINCT
:SELECT b.foo FROM ( SELECT DISTINCT stuff FROM asdf ) a JOIN bar b ON b.stuff = a.stuff
All these queries are semantically the same.
Common wisdom advices against using IN
predicates because it just doesn't look a right thing to do.
This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:
foreach ($bar as $bar_record) { foreach ($asdf as $asdf_record) { if ($bar_record->stuff == $asdf_record->stuff) output ($bar_record); break; } }
This just radiates inefficiency, since the inner rowset should be iterated for each row returned.
EXISTS
looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.
Finally, the JOIN
looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN
? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!
People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN
construct.
Now, let's make two sample tables and see how it's done:
Table creation details
Table t_outer
contains 10 records.
Table t_inner
contains 1,000,000 records. The field uvalue
has a UNIQUE
index defined on it, while the field pvalue
has no index at all, and contains duplicates (each value from 1 to 500,000 is stored twice).
Now, let's make a couple of queries.
IN on UNIQUE column
SELECT id, value FROM t_outer WHERE value IN ( SELECT uvalue FROM t_inner )
This query returns all 10 values from the t_outer
instantly.
If we look into the query plan we will see that this is just a plain NESTED LOOPS
join on the index. No whole subquery reevaluation, the index is used and used efficiently.
Oracle is smart enough to make three logical constructs:
IN
is equivalent to aJOIN
/DISTINCT
DISTINCT
on a column marked asUNIQUE
andNOT NULL
is redundant, so theIN
is equivalent to a simpleJOIN
IN
is equivalent to a simpleJOIN
so any valid join method and the access methods can be used
This makes this query super fast.
JOIN / DISTINCT on UNIQUE column
SELECT id, value FROM t_outer o JOIN ( SELECT DISTINCT uvalue FROM t_inner ) ON o.value = uvalue
Exactly same query plan. From Oracle's point of view, these queries are identical. Oracle just sees that they are the same and can avoid the whole query evaluation at all: it just uses index access.
Now let's see how smart Oracle will be when it will have to search for a non-indexed column which, in addition, contains duplicates.
JOIN / DISTINCT on a non-indexed column
SELECT id, value FROM t_outer o JOIN ( SELECT DISTINCT pvalue FROM t_inner ) ON o.value = pvalue
Now the query takes quite a significant time, more than 2 seconds.
Oracle decided to make the join first and then get rid of the duplicates, despite the fact that the DISTINCT
clause is inside the inline view, not outside. However, semantically this is the same and Oracle knows it.
Now, how will the IN
predicate work? Will it be so smart to convert the query to the JOIN
which completes in 2 seconds or it will be running the inner query over and over and over again?
Let's see.
IN on a non-indexed column
SELECT id, value FROM t_outer WHERE value IN ( SELECT pvalue FROM t_inner )
1 millisecond.
The plan says TABLE ACCESS FULL
on both tables, and the t_inner
is very large, there are 1,000,000 rows there and they are long rows.
It just cannot be scanned in 1 ms. However, this is what we see. How can it be?
If we look closer to the plan we will see a special method there, a HASH JOIN SEMI
. This is a such called semi-join. It is designed just for the queries like this, using IN
or EXISTS
with an equality condition on the correlated field.
This method works as follows:
- It builds a hash table on the outer query (the one that is probed). I repeat: it is the outer query that is hashed, not the inner one. Everything that needs to be returned in
SELECT
clause goes to the hash table - Then it takes every row from the inner table and searches the hash table for the value of the column being tested
- If the value is found, the engine returns all records from the hash table that match this value and deletes them from the hash table. Unlike the plain
JOIN
, the hash table is dynamic: it is populated before theJOIN
begins and each row is removed from it as its gets returned - When either the hash table or the subquery is out if records, the query completes
Instead of removing the duplicates from the subquery before the iteration begins, the engine removes the values from the hash table as long as the iteration goes. This cannot be easily rewritten in SQL since SQL operates on immutable sets. However, outside of the set theory this can be implemented easily.
The duplicates in the subquery are not a problem anymore. Imagine we have value 42 three times in the outer table and five times in the inner table. The hash table will be build and all three outer query records will get into this hash table.
Now, the first value of 42 is fetched out of the subquery. This will result in a hash hit, all three records sharing this value will be found, returned and, which is most important, deleted.
What happens when the second value of 42 will be fetched? Well, nothing will happen. There are no more records in the hash table, they were deleted as they were returned. That's why this duplicate will just result in a hash miss and go away. Same is true for all other duplicates.
Now we can see what happened in our query. The engine build the hash table over the 10 values from t_outer. Then it took the first value from t_inner
(which happened to be a 1), found it in the hash table and returned the hashed record from t_outer
. The second value of 1 fetched from the subquery resulted in a hash miss and was promptly forgotten. In 20 records the hash table was empty. And when the hash table is empty, there is no point in running the query further: nothing could possibly be returned anymore.
Of course, there was a little cheating here: first 10 numbers just happened to be fetched out in first 10 records. If we would search for values from 499,990 to 500,000, the query would run for the same 2 seconds. However, in average, the hash table would be emptied much faster than the DISTINCT
set could be built to use with a JOIN
.
EXISTS
Oracle treats EXISTS
predicate with an equality condition for a correlated value exactly the same as an IN
predicate. Let's see it:
Exactly same plans, results and performance.
Summary
IN predicate is an ugly duckling of SQL. Due to the fact it appears inefficient in the eye of an inexperienced developer, it is generally considered non-efficient and EXISTS
or a JOIN
are advised to use instead.
However, Oracle optimizes IN
fairly well, using indexes, UNIQUE
constraints, NOT NULL
modifiers and other extra information wisely.
Even in case of duplicates and absence of an index on the subquery field, Oracle is able to use HASH SEMI JOIN
method which is more efficient than a JOIN
/ DISTINCT
solution.
EXISTS
is optimized exactly the same, however, IN
is more readable and concise.
IN predicate is exactly what should be use to search records against a list of values in Oracle.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
In the ‘IN on a non-indexed column’ explanation, can you explain how the correct results are returned? You say when a match is found, that record is deleted from the hash table. Does the deleted entry then get moved to the result set? Because it seems like what you are left with in the hash table is all the non matches, which would be the opposite of what you want.
Dave
11 Oct 12 at 20:47
@Dave: yes, exactly.
A record is returned into resultset and deleted from the hash table. What is left is the set of non-matches which is discarded (but would be returned as well if we had a FULL JOIN).
What is important is that the query stops as soon as the hash table is empty, which is why it only takes 1 ms (despite the inner table having 1M records and a FULL SCAN).
Quassnoi
11 Oct 12 at 20:52
Thanks for the detailed comparison of IN, Exists, and Join. I plan to run some of my own tests where the matching values are distributed in the big table. I’ll need proof for when I tell my development team that IN in Oracle is actually very efficient. :)
Quentin
3 Apr 13 at 00:35
Nice example for IN and Exists.
Jitu Singapore
26 Aug 14 at 17:10
Good explanation of In/Join/Exists
RV
28 Dec 15 at 03:18
Thanks! Very helpful.
Danny
27 Jul 16 at 14:41
Very well explained. Thank you so much for all your effort in explaining in detail.
Mahesh
11 Dec 16 at 10:07
Interesting article and explanations.
Still I not entirely convinced about the EXISTS. I read somewhere that althought the execution plan and table access are pretty much alike it can in fact be faster when executing as it could probably avoid considerable CPU.
I don’t recall where I read that, but one article stated that the EXISTS clause will be true and return the outer record on the very first matching occurence with the inner table, thus avoiding comparing all the subsequent records.
This is kind of logical, but unfortunately I found no other evidence of such claim.
Leandro
12 Apr 17 at 17:53
@Leandro: that here is the difference between this blog and “one article”. On this blog you can find the actual schema setup, the actual query, the actual plan and the actual performance. If have some doubts you can just copy-paste the setup to your own system and see for yourself.
Regarding your claim about EXISTS: did you read the blog at all? A good half of it explains the algorithm (the same algorithm used by both IN and EXISTS) and how is it better than a naive approach assumed by many people (including, obviously, the author or “one article”).
Quassnoi
13 Apr 17 at 13:59
The simple rule I will follow; If I need to check whether I have any matches between a table “bar” and a table “asdf”, but only really care about the results from table “bar”, I will be using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an(INNER) JOIN.
Faisal
15 Oct 17 at 16:49
We have used IN with bind variables from .net code. How efficient will is it to do the following?
where Id in (:Id1, :Id2…:Id1000) or Id in (:Id1001, :Id1002…) …
Some have argued that it would be better to instert the id’s in temporary tables and join against that table instead of using In(…) or In(..) …
What do you think?
Kjetil
27 Sep 18 at 16:37