EXPLAIN EXTENDED

How to create fast database queries

IN vs. JOIN vs. EXISTS: Oracle

with 3 comments

Answering questions asked on the site.

Jason asks:

I have a co-worker who swears that Oracle IN queries are slow and refuses to use them. For example:

SELECT  foo
FROM    bar
WHERE   bar.stuff IN
        (
        SELECT  stuff
        FROM    asdf
        )

Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.

So, in general is an IN query very expensive? I'm having trouble putting together many non-trivial queries to run on EXPLAIN PLAN.

Does the IN predicate always have inferior efficiency compared to it's counterparts, EXISTS and JOIN?

Let's check.

First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:

  • IN predicate:

    SELECT  foo
    FROM    bar
    WHERE   bar.stuff IN
            (
            SELECT  stuff
            FROM    asdf
            )
    
  • EXISTS predicate:

    SELECT  foo
    FROM    bar
    WHERE   EXISTS
            (
            SELECT  NULL
            FROM    asdf 
            WHERE   asdf.stuff = bar.stuff
            )
    
  • JOIN / DISTINCT:

    SELECT  b.foo
    FROM    (
            SELECT  DISTINCT stuff
            FROM    asdf
            ) a
    JOIN    bar b
    ON      b.stuff = a.stuff
    

All these queries are semantically the same.

Common wisdom advices against using IN predicates because it just doesn't look a right thing to do.

This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:

foreach ($bar as $bar_record) {
    foreach ($asdf as $asdf_record) {
        if ($bar_record->stuff == $asdf_record->stuff)
            output ($bar_record);
            break;
    }
}

This just radiates inefficiency, since the inner rowset should be iterated for each row returned.

EXISTS looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.

Finally, the JOIN looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!

People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN construct.

Now, let's make two sample tables and see how it's done:

Table creation details

Table t_outer contains 10 records.

Table t_inner contains 1,000,000 records. The field uvalue has a UNIQUE index defined on it, while the field pvalue has no index at all, and contains duplicates (each value from 1 to 500,000 is stored twice).

Now, let's make a couple of queries.

IN on UNIQUE column

SELECT  id, value
FROM    t_outer
WHERE   value IN
        (
        SELECT  uvalue
        FROM    t_inner
        )

View query details

This query returns all 10 values from the t_outer instantly.

If we look into the query plan we will see that this is just a plain NESTED LOOPS join on the index. No whole subquery reevaluation, the index is used and used efficiently.

Oracle is smart enough to make three logical constructs:

  1. IN is equivalent to a JOIN / DISTINCT
  2. DISTINCT on a column marked as UNIQUE and NOT NULL is redundant, so the IN is equivalent to a simple JOIN
  3. IN is equivalent to a simple JOIN so any valid join method and the access methods can be used

This makes this query super fast.

JOIN / DISTINCT on UNIQUE column

SELECT  id, value
FROM    t_outer o
JOIN    (
        SELECT  DISTINCT uvalue
        FROM    t_inner
        )
ON      o.value = uvalue

View query details

Exactly same query plan. From Oracle's point of view, these queries are identical. Oracle just sees that they are the same and can avoid the whole query evaluation at all: it just uses index access.

Now let's see how smart Oracle will be when it will have to search for a non-indexed column which, in addition, contains duplicates.

JOIN / DISTINCT on a non-indexed column

SELECT  id, value
FROM    t_outer o
JOIN    (
        SELECT  DISTINCT pvalue
        FROM    t_inner
        )
ON      o.value = pvalue

View query details

Now the query takes quite a significant time, more than 2 seconds.

Oracle decided to make the join first and then get rid of the duplicates, despite the fact that the DISTINCT clause is inside the inline view, not outside. However, semantically this is the same and Oracle knows it.

Now, how will the IN predicate work? Will it be so smart to convert the query to the JOIN which completes in 2 seconds or it will be running the inner query over and over and over again?

Let's see.

IN on a non-indexed column

SELECT  id, value
FROM    t_outer
WHERE   value IN
        (
        SELECT  pvalue
        FROM    t_inner
        )

View query details

1 millisecond.

The plan says TABLE ACCESS FULL on both tables, and the t_inner is very large, there are 1,000,000 rows there and they are long rows.

It just cannot be scanned in 1 ms. However, this is what we see. How can it be?

If we look closer to the plan we will see a special method there, a HASH JOIN SEMI. This is a such called semi-join. It is designed just for the queries like this, using IN or EXISTS with an equality condition on the correlated field.

This method works as follows:

  • It builds a hash table on the outer query (the one that is probed). I repeat: it is the outer query that is hashed, not the inner one. Everything that needs to be returned in SELECT clause goes to the hash table
  • Then it takes every row from the inner table and searches the hash table for the value of the column being tested
  • If the value is found, the engine returns all records from the hash table that match this value and deletes them from the hash table. Unlike the plain JOIN, the hash table is dynamic: it is populated before the JOIN begins and each row is removed from it as its gets returned
  • When either the hash table or the subquery is out if records, the query completes

Instead of removing the duplicates from the subquery before the iteration begins, the engine removes the values from the hash table as long as the iteration goes. This cannot be easily rewritten in SQL since SQL operates on immutable sets. However, outside of the set theory this can be implemented easily.

The duplicates in the subquery are not a problem anymore. Imagine we have value 42 three times in the outer table and five times in the inner table. The hash table will be build and all three outer query records will get into this hash table.

Now, the first value of 42 is fetched out of the subquery. This will result in a hash hit, all three records sharing this value will be found, returned and, which is most important, deleted.

What happens when the second value of 42 will be fetched? Well, nothing will happen. There are no more records in the hash table, they were deleted as they were returned. That's why this duplicate will just result in a hash miss and go away. Same is true for all other duplicates.

Now we can see what happened in our query. The engine build the hash table over the 10 values from t_outer. Then it took the first value from t_inner (which happened to be a 1), found it in the hash table and returned the hashed record from t_outer. The second value of 1 fetched from the subquery resulted in a hash miss and was promptly forgotten. In 20 records the hash table was empty. And when the hash table is empty, there is no point in running the query further: nothing could possibly be returned anymore.

Of course, there was a little cheating here: first 10 numbers just happened to be fetched out in first 10 records. If we would search for values from 499,990 to 500,000, the query would run for the same 2 seconds. However, in average, the hash table would be emptied much faster than the DISTINCT set could be built to use with a JOIN.

EXISTS

Oracle treats EXISTS predicate with an equality condition for a correlated value exactly the same as an IN predicate. Let's see it:

EXISTS on unique value

EXISTS on a non-unique value

Exactly same plans, results and performance.

Summary

IN predicate is an ugly duckling of SQL. Due to the fact it appears inefficient in the eye of an inexperienced developer, it is generally considered non-efficient and EXISTS or a JOIN are advised to use instead.

However, Oracle optimizes IN fairly well, using indexes, UNIQUE constraints, NOT NULL modifiers and other extra information wisely.

Even in case of duplicates and absence of an index on the subquery field, Oracle is able to use HASH SEMI JOIN method which is more efficient than a JOIN / DISTINCT solution.

EXISTS is optimized exactly the same, however, IN is more readable and concise.

IN predicate is exactly what should be use to search records against a list of values in Oracle.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 30th, 2009 at 11:00 pm

Posted in Oracle

3 Responses to 'IN vs. JOIN vs. EXISTS: Oracle'

Subscribe to comments with RSS or TrackBack to 'IN vs. JOIN vs. EXISTS: Oracle'.

  1. In the ‘IN on a non-indexed column’ explanation, can you explain how the correct results are returned? You say when a match is found, that record is deleted from the hash table. Does the deleted entry then get moved to the result set? Because it seems like what you are left with in the hash table is all the non matches, which would be the opposite of what you want.

    Dave

    11 Oct 12 at 20:47

  2. @Dave: yes, exactly.

    A record is returned into resultset and deleted from the hash table. What is left is the set of non-matches which is discarded (but would be returned as well if we had a FULL JOIN).

    What is important is that the query stops as soon as the hash table is empty, which is why it only takes 1 ms (despite the inner table having 1M records and a FULL SCAN).

    Quassnoi

    11 Oct 12 at 20:52

  3. Thanks for the detailed comparison of IN, Exists, and Join. I plan to run some of my own tests where the matching values are distributed in the big table. I’ll need proof for when I tell my development team that IN in Oracle is actually very efficient. :)

    Quentin

    3 Apr 13 at 00:35

Leave a Reply