IN vs. JOIN vs. EXISTS: Oracle
Answering questions asked on the site.
Jason asks:
I have a co-worker who swears that Oracle
INqueries are slow and refuses to use them. For example:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.
So, in general is an
INquery very expensive? I'm having trouble putting together many non-trivial queries to run onEXPLAIN PLAN.
Does the IN predicate always have inferior efficiency compared to it's counterparts, EXISTS and JOIN?
Let's check.
First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:
INpredicate:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )
EXISTSpredicate:SELECT foo FROM bar WHERE EXISTS ( SELECT NULL FROM asdf WHERE asdf.stuff = bar.stuff )
JOIN/DISTINCT:SELECT b.foo FROM ( SELECT DISTINCT stuff FROM asdf ) a JOIN bar b ON b.stuff = a.stuff
All these queries are semantically the same.
Common wisdom advices against using IN predicates because it just doesn't look a right thing to do.
This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:
foreach ($bar as $bar_record) {
foreach ($asdf as $asdf_record) {
if ($bar_record->stuff == $asdf_record->stuff)
output ($bar_record);
break;
}
}
This just radiates inefficiency, since the inner rowset should be iterated for each row returned.
EXISTS looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.
Finally, the JOIN looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!
People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN construct.
Now, let's make two sample tables and see how it's done:
Table creation details
Table t_outer contains 10 records.
Table t_inner contains 1,000,000 records. The field uvalue has a UNIQUE index defined on it, while the field pvalue has no index at all, and contains duplicates (each value from 1 to 500,000 is stored twice).
Now, let's make a couple of queries.
IN on UNIQUE column
SELECT id, value FROM t_outer WHERE value IN ( SELECT uvalue FROM t_inner )
This query returns all 10 values from the t_outer instantly.
If we look into the query plan we will see that this is just a plain NESTED LOOPS join on the index. No whole subquery reevaluation, the index is used and used efficiently.
Oracle is smart enough to make three logical constructs:
INis equivalent to aJOIN/DISTINCTDISTINCTon a column marked asUNIQUEandNOT NULLis redundant, so theINis equivalent to a simpleJOININis equivalent to a simpleJOINso any valid join method and the access methods can be used
This makes this query super fast.
JOIN / DISTINCT on UNIQUE column
SELECT id, value FROM t_outer o JOIN ( SELECT DISTINCT uvalue FROM t_inner ) ON o.value = uvalue
Exactly same query plan. From Oracle's point of view, these queries are identical. Oracle just sees that they are the same and can avoid the whole query evaluation at all: it just uses index access.
Now let's see how smart Oracle will be when it will have to search for a non-indexed column which, in addition, contains duplicates.
JOIN / DISTINCT on a non-indexed column
SELECT id, value FROM t_outer o JOIN ( SELECT DISTINCT pvalue FROM t_inner ) ON o.value = pvalue
Now the query takes quite a significant time, more than 2 seconds.
Oracle decided to make the join first and then get rid of the duplicates, despite the fact that the DISTINCT clause is inside the inline view, not outside. However, semantically this is the same and Oracle knows it.
Now, how will the IN predicate work? Will it be so smart to convert the query to the JOIN which completes in 2 seconds or it will be running the inner query over and over and over again?
Let's see.
IN on a non-indexed column
SELECT id, value FROM t_outer WHERE value IN ( SELECT pvalue FROM t_inner )
1 millisecond.
The plan says TABLE ACCESS FULL on both tables, and the t_inner is very large, there are 1,000,000 rows there and they are long rows.
It just cannot be scanned in 1 ms. However, this is what we see. How can it be?
If we look closer to the plan we will see a special method there, a HASH JOIN SEMI. This is a such called semi-join. It is designed just for the queries like this, using IN or EXISTS with an equality condition on the correlated field.
This method works as follows:
- It builds a hash table on the outer query (the one that is probed). I repeat: it is the outer query that is hashed, not the inner one. Everything that needs to be returned in
SELECTclause goes to the hash table - Then it takes every row from the inner table and searches the hash table for the value of the column being tested
- If the value is found, the engine returns all records from the hash table that match this value and deletes them from the hash table. Unlike the plain
JOIN, the hash table is dynamic: it is populated before theJOINbegins and each row is removed from it as its gets returned - When either the hash table or the subquery is out if records, the query completes
Instead of removing the duplicates from the subquery before the iteration begins, the engine removes the values from the hash table as long as the iteration goes. This cannot be easily rewritten in SQL since SQL operates on immutable sets. However, outside of the set theory this can be implemented easily.
The duplicates in the subquery are not a problem anymore. Imagine we have value 42 three times in the outer table and five times in the inner table. The hash table will be build and all three outer query records will get into this hash table.
Now, the first value of 42 is fetched out of the subquery. This will result in a hash hit, all three records sharing this value will be found, returned and, which is most important, deleted.
What happens when the second value of 42 will be fetched? Well, nothing will happen. There are no more records in the hash table, they were deleted as they were returned. That's why this duplicate will just result in a hash miss and go away. Same is true for all other duplicates.
Now we can see what happened in our query. The engine build the hash table over the 10 values from t_outer. Then it took the first value from t_inner (which happened to be a 1), found it in the hash table and returned the hashed record from t_outer. The second value of 1 fetched from the subquery resulted in a hash miss and was promptly forgotten. In 20 records the hash table was empty. And when the hash table is empty, there is no point in running the query further: nothing could possibly be returned anymore.
Of course, there was a little cheating here: first 10 numbers just happened to be fetched out in first 10 records. If we would search for values from 499,990 to 500,000, the query would run for the same 2 seconds. However, in average, the hash table would be emptied much faster than the DISTINCT set could be built to use with a JOIN.
EXISTS
Oracle treats EXISTS predicate with an equality condition for a correlated value exactly the same as an IN predicate. Let's see it:
Exactly same plans, results and performance.
Summary
IN predicate is an ugly duckling of SQL. Due to the fact it appears inefficient in the eye of an inexperienced developer, it is generally considered non-efficient and EXISTS or a JOIN are advised to use instead.
However, Oracle optimizes IN fairly well, using indexes, UNIQUE constraints, NOT NULL modifiers and other extra information wisely.
Even in case of duplicates and absence of an index on the subquery field, Oracle is able to use HASH SEMI JOIN method which is more efficient than a JOIN / DISTINCT solution.
EXISTS is optimized exactly the same, however, IN is more readable and concise.
IN predicate is exactly what should be use to search records against a list of values in Oracle.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Subscribe in a reader
In the ‘IN on a non-indexed column’ explanation, can you explain how the correct results are returned? You say when a match is found, that record is deleted from the hash table. Does the deleted entry then get moved to the result set? Because it seems like what you are left with in the hash table is all the non matches, which would be the opposite of what you want.
Dave
11 Oct 12 at 20:47
@Dave: yes, exactly.
A record is returned into resultset and deleted from the hash table. What is left is the set of non-matches which is discarded (but would be returned as well if we had a FULL JOIN).
What is important is that the query stops as soon as the hash table is empty, which is why it only takes 1 ms (despite the inner table having 1M records and a FULL SCAN).
Quassnoi
11 Oct 12 at 20:52
Thanks for the detailed comparison of IN, Exists, and Join. I plan to run some of my own tests where the matching values are distributed in the big table. I’ll need proof for when I tell my development team that IN in Oracle is actually very efficient. :)
Quentin
3 Apr 13 at 00:35
Nice example for IN and Exists.
Jitu Singapore
26 Aug 14 at 17:10
Good explanation of In/Join/Exists
RV
28 Dec 15 at 03:18
Thanks! Very helpful.
Danny
27 Jul 16 at 14:41
Very well explained. Thank you so much for all your effort in explaining in detail.
Mahesh
11 Dec 16 at 10:07
Interesting article and explanations.
Still I not entirely convinced about the EXISTS. I read somewhere that althought the execution plan and table access are pretty much alike it can in fact be faster when executing as it could probably avoid considerable CPU.
I don’t recall where I read that, but one article stated that the EXISTS clause will be true and return the outer record on the very first matching occurence with the inner table, thus avoiding comparing all the subsequent records.
This is kind of logical, but unfortunately I found no other evidence of such claim.
Leandro
12 Apr 17 at 17:53
@Leandro: that here is the difference between this blog and “one article”. On this blog you can find the actual schema setup, the actual query, the actual plan and the actual performance. If have some doubts you can just copy-paste the setup to your own system and see for yourself.
Regarding your claim about EXISTS: did you read the blog at all? A good half of it explains the algorithm (the same algorithm used by both IN and EXISTS) and how is it better than a naive approach assumed by many people (including, obviously, the author or “one article”).
Quassnoi
13 Apr 17 at 13:59
The simple rule I will follow; If I need to check whether I have any matches between a table “bar” and a table “asdf”, but only really care about the results from table “bar”, I will be using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an(INNER) JOIN.
Faisal
15 Oct 17 at 16:49
We have used IN with bind variables from .net code. How efficient will is it to do the following?
where Id in (:Id1, :Id2…:Id1000) or Id in (:Id1001, :Id1002…) …
Some have argued that it would be better to instert the id’s in temporary tables and join against that table instead of using In(…) or In(..) …
What do you think?
Kjetil
27 Sep 18 at 16:37