IN vs. JOIN vs. EXISTS: Oracle at EXPLAIN EXTENDED

IN vs. JOIN vs. EXISTS: Oracle

Answering questions asked on the site.

Jason asks:

I have a co-worker who swears that Oracle IN queries are slow and refuses to use them. For example:
SELECT  foo
FROM    bar
WHERE   bar.stuff IN
(
SELECT  stuff
FROM    asdf
)
Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.

So, in general is an IN query very expensive? I'm having trouble putting together many non-trivial queries to run on EXPLAIN PLAN.

Does the IN predicate always have inferior efficiency compared to it's counterparts, EXISTS and JOIN?

Let's check.

First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:

IN predicate:

SELECT  foo
FROM    bar
WHERE   bar.stuff IN
(
SELECT  stuff
FROM    asdf
)

EXISTS predicate:

SELECT  foo
FROM    bar
WHERE   EXISTS
(
SELECT  NULL
FROM    asdf
WHERE   asdf.stuff = bar.stuff
)

JOIN / DISTINCT:

SELECT  b.foo
FROM    (
SELECT  DISTINCT stuff
FROM    asdf
) a
JOIN    bar b
ON      b.stuff = a.stuff

All these queries are semantically the same.

Common wisdom advices against using IN predicates because it just doesn't look a right thing to do.

This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:

foreach ($bar as $bar_record) {
foreach ($asdf as $asdf_record) {
if ($bar_record-&gt;stuff == $asdf_record-&gt;stuff)
output ($bar_record);
break;
}
}

This just radiates inefficiency, since the inner rowset should be iterated for each row returned.

EXISTS looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.

Finally, the JOIN looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!

People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN construct.

Now, let's make two sample tables and see how it's done:

Table creation details

CREATE TABLE t_outer (id NOT NULL, value NOT NULL, stuffing NOT NULL)
AS
SELECT  level, level, CAST(RPAD('*', 100, '*') AS VARCHAR2(100))
FROM    dual
CONNECT BY
level &lt;= 10
/

CREATE TABLE t_inner (id NOT NULL, pvalue NOT NULL, uvalue NOT NULL, stuffing NOT NULL)
AS
SELECT  level, FLOOR((level - 1) / 2) + 1, level, CAST(RPAD('*', 100, '*') AS VARCHAR2(100))
FROM    dual
CONNECT BY
        level &lt;= 1000000
/

ALTER TABLE t_outer ADD CONSTRAINT pk_outer_id PRIMARY KEY (id)
/
ALTER TABLE t_inner ADD CONSTRAINT pk_inner_id PRIMARY KEY (id)
/
CREATE UNIQUE INDEX ux_inner_uvalue ON t_inner (uvalue)
/

Table t_outer contains 10 records.

Table t_inner contains 1,000,000 records. The field uvalue has a UNIQUE index defined on it, while the field pvalue has no index at all, and contains duplicates (each value from 1 to 500,000 is stored twice).

Now, let's make a couple of queries.

IN on UNIQUE column

SELECT  id, value
FROM    t_outer
WHERE   value IN
(
SELECT  uvalue
FROM    t_inner
)

View query details

ID	VALUE
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10
10 rows fetched in 0.0003s (0.0007s)

SELECT STATEMENT 
 NESTED LOOPS 
  TABLE ACCESS FULL, 20090930_in.T_OUTER
  INDEX UNIQUE SCAN, 20090930_in.UX_INNER_UVALUE

This query returns all 10 values from the t_outer instantly.

If we look into the query plan we will see that this is just a plain NESTED LOOPS join on the index. No whole subquery reevaluation, the index is used and used efficiently.

Oracle is smart enough to make three logical constructs:

IN is equivalent to a JOIN / DISTINCT
DISTINCT on a column marked as UNIQUE and NOT NULL is redundant, so the IN is equivalent to a simple JOIN
IN is equivalent to a simple JOIN so any valid join method and the access methods can be used

This makes this query super fast.

JOIN / DISTINCT on UNIQUE column

SELECT  id, value
FROM    t_outer o
JOIN    (
SELECT  DISTINCT uvalue
FROM    t_inner
)
ON      o.value = uvalue

View query details

ID	VALUE
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10
10 rows fetched in 0.0003s (0.0007s)

SELECT STATEMENT 
 VIEW 
  NESTED LOOPS 
   TABLE ACCESS FULL, 20090930_in.T_OUTER
   INDEX UNIQUE SCAN, 20090930_in.UX_INNER_UVALUE

Exactly same query plan. From Oracle's point of view, these queries are identical. Oracle just sees that they are the same and can avoid the whole query evaluation at all: it just uses index access.

Now let's see how smart Oracle will be when it will have to search for a non-indexed column which, in addition, contains duplicates.

JOIN / DISTINCT on a non-indexed column

SELECT  id, value
FROM    t_outer o
JOIN    (
SELECT  DISTINCT pvalue
FROM    t_inner
)
ON      o.value = pvalue

View query details

ID	VALUE
5	5
4	4
9	9
10	10
8	8
6	6
3	3
1	1
2	2
7	7
10 rows fetched in 0.0003s (2.2187s)

SELECT STATEMENT 
 VIEW 
  HASH UNIQUE
   HASH JOIN 
    TABLE ACCESS FULL, 20090930_in.T_OUTER
    TABLE ACCESS FULL, 20090930_in.T_INNER

Now the query takes quite a significant time, more than 2 seconds.

Oracle decided to make the join first and then get rid of the duplicates, despite the fact that the DISTINCT clause is inside the inline view, not outside. However, semantically this is the same and Oracle knows it.

Now, how will the IN predicate work? Will it be so smart to convert the query to the JOIN which completes in 2 seconds or it will be running the inner query over and over and over again?

Let's see.

IN on a non-indexed column

SELECT  id, value
FROM    t_outer
WHERE   value IN
(
SELECT  pvalue
FROM    t_inner
)

View query details

ID	VALUE
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10
10 rows fetched in 0.0003s (0.0016s)

SELECT STATEMENT 
 HASH JOIN SEMI
  TABLE ACCESS FULL, 20090930_in.T_OUTER
  TABLE ACCESS FULL, 20090930_in.T_INNER

1 millisecond.

The plan says TABLE ACCESS FULL on both tables, and the t_inner is very large, there are 1,000,000 rows there and they are long rows.

It just cannot be scanned in 1 ms. However, this is what we see. How can it be?

If we look closer to the plan we will see a special method there, a HASH JOIN SEMI. This is a such called semi-join. It is designed just for the queries like this, using IN or EXISTS with an equality condition on the correlated field.

This method works as follows:

It builds a hash table on the outer query (the one that is probed). I repeat: it is the outer query that is hashed, not the inner one. Everything that needs to be returned in SELECT clause goes to the hash table
Then it takes every row from the inner table and searches the hash table for the value of the column being tested
If the value is found, the engine returns all records from the hash table that match this value and deletes them from the hash table. Unlike the plain JOIN, the hash table is dynamic: it is populated before the JOIN begins and each row is removed from it as its gets returned
When either the hash table or the subquery is out if records, the query completes

Instead of removing the duplicates from the subquery before the iteration begins, the engine removes the values from the hash table as long as the iteration goes. This cannot be easily rewritten in SQL since SQL operates on immutable sets. However, outside of the set theory this can be implemented easily.

The duplicates in the subquery are not a problem anymore. Imagine we have value 42 three times in the outer table and five times in the inner table. The hash table will be build and all three outer query records will get into this hash table.

Now, the first value of 42 is fetched out of the subquery. This will result in a hash hit, all three records sharing this value will be found, returned and, which is most important, deleted.

What happens when the second value of 42 will be fetched? Well, nothing will happen. There are no more records in the hash table, they were deleted as they were returned. That's why this duplicate will just result in a hash miss and go away. Same is true for all other duplicates.

Now we can see what happened in our query. The engine build the hash table over the 10 values from t_outer. Then it took the first value from t_inner (which happened to be a 1), found it in the hash table and returned the hashed record from t_outer. The second value of 1 fetched from the subquery resulted in a hash miss and was promptly forgotten. In 20 records the hash table was empty. And when the hash table is empty, there is no point in running the query further: nothing could possibly be returned anymore.

Of course, there was a little cheating here: first 10 numbers just happened to be fetched out in first 10 records. If we would search for values from 499,990 to 500,000, the query would run for the same 2 seconds. However, in average, the hash table would be emptied much faster than the DISTINCT set could be built to use with a JOIN.

EXISTS

Oracle treats EXISTS predicate with an equality condition for a correlated value exactly the same as an IN predicate. Let's see it:

EXISTS on unique value

SELECT  id, value
FROM    t_outer
WHERE   EXISTS
(
SELECT  NULL
FROM    t_inner
WHERE   uvalue = value
)

ID	VALUE
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10
10 rows fetched in 0.0003s (0.0089s)

SELECT STATEMENT 
 NESTED LOOPS SEMI
  TABLE ACCESS FULL, 20090930_in.T_OUTER
  INDEX UNIQUE SCAN, 20090930_in.UX_INNER_UVALUE

EXISTS on a non-unique value

SELECT  id, value
FROM    t_outer
WHERE   EXISTS
(
SELECT  NULL
FROM    t_inner
WHERE   pvalue = value
)

ID	VALUE
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10
10 rows fetched in 0.0003s (0.0016s)

SELECT STATEMENT 
 HASH JOIN SEMI
  TABLE ACCESS FULL, 20090930_in.T_OUTER
  TABLE ACCESS FULL, 20090930_in.T_INNER

Exactly same plans, results and performance.

Summary

IN predicate is an ugly duckling of SQL. Due to the fact it appears inefficient in the eye of an inexperienced developer, it is generally considered non-efficient and EXISTS or a JOIN are advised to use instead.

However, Oracle optimizes IN fairly well, using indexes, UNIQUE constraints, NOT NULL modifiers and other extra information wisely.

Even in case of duplicates and absence of an index on the subquery field, Oracle is able to use HASH SEMI JOIN method which is more efficient than a JOIN / DISTINCT solution.

EXISTS is optimized exactly the same, however, IN is more readable and concise.

IN predicate is exactly what should be use to search records against a list of values in Oracle.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 30th, 2009 at 11:00 pm

Posted in Oracle

11 Responses to 'IN vs. JOIN vs. EXISTS: Oracle'

Subscribe to comments with RSS

In the ‘IN on a non-indexed column’ explanation, can you explain how the correct results are returned? You say when a match is found, that record is deleted from the hash table. Does the deleted entry then get moved to the result set? Because it seems like what you are left with in the hash table is all the non matches, which would be the opposite of what you want.

Dave

11 Oct 12 at 20:47
@Dave: yes, exactly.

A record is returned into resultset and deleted from the hash table. What is left is the set of non-matches which is discarded (but would be returned as well if we had a FULL JOIN).

What is important is that the query stops as soon as the hash table is empty, which is why it only takes 1 ms (despite the inner table having 1M records and a FULL SCAN).

Quassnoi

11 Oct 12 at 20:52
Thanks for the detailed comparison of IN, Exists, and Join. I plan to run some of my own tests where the matching values are distributed in the big table. I’ll need proof for when I tell my development team that IN in Oracle is actually very efficient. :)

Quentin

3 Apr 13 at 00:35
Nice example for IN and Exists.

Jitu Singapore

26 Aug 14 at 17:10
Good explanation of In/Join/Exists

RV

28 Dec 15 at 03:18
Thanks! Very helpful.

Danny

27 Jul 16 at 14:41
Very well explained. Thank you so much for all your effort in explaining in detail.

Mahesh

11 Dec 16 at 10:07
Interesting article and explanations.
Still I not entirely convinced about the EXISTS. I read somewhere that althought the execution plan and table access are pretty much alike it can in fact be faster when executing as it could probably avoid considerable CPU.
I don’t recall where I read that, but one article stated that the EXISTS clause will be true and return the outer record on the very first matching occurence with the inner table, thus avoiding comparing all the subsequent records.
This is kind of logical, but unfortunately I found no other evidence of such claim.

Leandro

12 Apr 17 at 17:53
@Leandro: that here is the difference between this blog and “one article”. On this blog you can find the actual schema setup, the actual query, the actual plan and the actual performance. If have some doubts you can just copy-paste the setup to your own system and see for yourself.

Regarding your claim about EXISTS: did you read the blog at all? A good half of it explains the algorithm (the same algorithm used by both IN and EXISTS) and how is it better than a naive approach assumed by many people (including, obviously, the author or “one article”).

Quassnoi

13 Apr 17 at 13:59
The simple rule I will follow; If I need to check whether I have any matches between a table “bar” and a table “asdf”, but only really care about the results from table “bar”, I will be using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an(INNER) JOIN.

Faisal

15 Oct 17 at 16:49
We have used IN with bind variables from .net code. How efficient will is it to do the following?

where Id in (:Id1, :Id2…:Id1000) or Id in (:Id1001, :Id1002…) …

Some have argued that it would be better to instert the id’s in temporary tables and join against that table instead of using In(…) or In(..) …

What do you think?

Kjetil

27 Sep 18 at 16:37

EXPLAIN EXTENDED