EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Counting missing rows: Oracle

This is the 4th of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:

Let’s see how Oracle processes this predicate.

Let’s create the sample tables:

Table creation details

As in the previous articles, we have two tables: table1 with 1,000,000 rows and table2 with 1,000 rows.

The original query:

SELECT  COUNT(*)
FROM    "20090421_in".table1 t1
WHERE   id NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )
cnt
999000
1 row selected in 0,249 seconds

runs for 249 ms and produces the following execution plan:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  HASH JOIN RIGHT ANTI
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID

Oracle employs an ANTI JOIN algorithm here, but instead of MERGE ANTI JOIN used by SQL Server, Oracle‘s optimizer chooses a HASH ANTI JOIN algorithm.

This algorithm performs the following steps:

  1. Build a hash table over the values from the smaller dataset
  2. Use this hash table to look up the values from the larger dataset

As we can see, this algorithm is quite efficient, as it uses a special access path called INDEX FAST FULL SCAN to build the rowsets.

This access path scans the index pages in order they reside in the datafiles, ignoring the index order. This is faster than traversing the index tree, uses sequential disk I/O and can be parallelized.

Since all the values we need for the join are contained in the index itself and HASH JOIN (unlike MERGE JOIN) doesn’t care about the order of values, this access path is exactly what we need.

If we look for the missing grouper‘s:

SELECT  COUNT(*)
FROM    "20090421_in".table1 t1
WHERE   grouper NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )
cnt
1000
1 row selected in 0,269 seconds

, we get almost same execution time of 269 ms and the following plan:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  HASH JOIN RIGHT ANTI
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER

, which is the same as the previous one.

Now, let’s try the tricks we used to optimize this query in MySQL

Joining on DISTINCT:

SELECT  COUNT(*)
FROM    (
        SELECT	DISTINCT grouper
        FROM	"20090421_in".table1
        ) d
JOIN    "20090421_in".table1 t1
ON      t1.grouper = d.grouper
WHERE   d.grouper NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )
cnt
1000
1 row selected in 0,259 seconds
SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  NESTED LOOPS
   NESTED LOOPS ANTI
    VIEW
     HASH UNIQUE
      INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER
    INDEX UNIQUE SCAN, 20090421_in, PK_TABLE2_ID
   INDEX RANGE SCAN, 20090421_in, IX_TABLE1_GROUPER

This query uses HASH UNIQUE algorithm for seleting DISTINCT grouper‘s from table1, which implies building a hash table over the values, discarding the duplicates.

The resultset is then ANTI JOIN‘ed with table2 to produce the missing grouper‘s, and, finally, the missing grouper‘s are looked up in table1 using INDEX RANGE SCAN and counted.

As hashing the large rowset of 1,000,000 rows in both queries is the longest operation, this query is of same efficiency as the original query.

Now, let’s try reverse COUNT(*):

SELECT  (
        SELECT  COUNT(*)
        FROM    "20090421_in".table1 t1
        ) -
        (
        SELECT  COUNT(*)
        FROM    "20090421_in".table1 t1
        WHERE   id IN
                (
                SELECT  id
                FROM    "20090421_in".table2 t2
                )
        )
FROM    dual
cnt
999000
1 row selected in 0,065 seconds

As with MySQL and SQL Server, this trick greatly improves the query (it runs for 65 ms, four times as fast).

The execution plan says:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID
 SORT AGGREGATE
  NESTED LOOPS
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX UNIQUE SCAN, 20090421_in, PK_TABLE1_ID
 FAST DUAL

By changing the join condition, we made possible for table2 to be leading in the join.

This plan uses NESTED LOOPS with INDEX UNIQUE SCAN on table1, which eliminates the need in costly lookups over 1,000,000 rows. The engine still needs to COUNT(*) the rows, but it’s much faster than looking for actual values.

Summary: for NOT IN predicate in the tables above, Oracle uses HASH ANTI JOIN algorithm, which implies building a hash table from the results of the inner subquery.

This algorithm benefits from using INDEX FAST FULL SCAN, since no particular order is required for building and searching the hash table.

The query cannot be improved much by joining on DISTINCT grouper‘s, since Oracle still needs to use the large resultset for the cache lookups, but benefits from reversing the predicate and subtracting the result from the total COUNT(*).

To be continued.

Written by Quassnoi

April 21st, 2009 at 11:00 pm

Posted in Oracle

Comments are closed.