Oracle: how a NOT IN predicate works at EXPLAIN EXTENDED

Counting missing rows: Oracle

Comments enabled. I *really* need your comment

This is the 4^th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

NOT IN in MySQL
NOT IN in MySQL (using DISTINCT)
NOT IN in SQL Server
NOT IN in Oracle
NOT IN in PostgreSQL

Let's see how Oracle processes this predicate.

Let's create the sample tables:

Table creation details

CREATE USER "20090421_in" IDENTIFIED BY "20090421_in"
/
GRANT CREATE ANY TABLE TO "20090421_in"
/
GRANT UNLIMITED TABLESPACE TO "20090421_in"
/
ALTER SESSION SET CURRENT_SCHEMA="20090421_in"
/
CREATE TABLE table1 (
        id NUMBER(10) NOT NULL,
        grouper NUMBER(10) NOT NULL,
        value VARCHAR2(50)
)
/
ALTER TABLE table1
ADD CONSTRAINT pk_table1_id PRIMARY KEY (id)
/
CREATE INDEX ix_table1_grouper ON table1 (grouper)
/

CREATE TABLE table2 (
        id NUMBER(10) NOT NULL,
        value VARCHAR2(50)
)
/
ALTER TABLE table2
ADD CONSTRAINT pk_table2_id PRIMARY KEY (id)
/
INSERT
INTO    table1 (id, grouper, value)
SELECT  level, MOD(level, 1000), 'Value One ' || level
FROM    dual
CONNECT BY
        level <= 1000000
/
INSERT
INTO    table2 (id, value)
SELECT  level, 'Value Two ' || level
FROM    dual
CONNECT BY
        level <= 1000
/
COMMIT
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090421_in"');
END;
/
&#91;/sourcecode&#93;
</div>

As in the previous articles, we have two tables: <code>table1</code> with <strong>1,000,000</strong> rows and <code>table2</code> with <strong>1,000</strong> rows.

The original query:


SELECT  COUNT(*)
FROM    "20090421_in".table1 t1
WHERE   id NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )

cnt
999000
1 row selected in 0,249 seconds

runs for 249 ms and produces the following execution plan:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  HASH JOIN RIGHT ANTI
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID

Oracle employs an ANTI JOIN algorithm here, but instead of MERGE ANTI JOIN used by SQL Server, Oracle's optimizer chooses a HASH ANTI JOIN algorithm.

This algorithm performs the following steps:

Build a hash table over the values from the smaller dataset
Use this hash table to look up the values from the larger dataset

As we can see, this algorithm is quite efficient, as it uses a special access path called INDEX FAST FULL SCAN to build the rowsets.

This access path scans the index pages in order they reside in the datafiles, ignoring the index order. This is faster than traversing the index tree, uses sequential disk I/O and can be parallelized.

Since all the values we need for the join are contained in the index itself and HASH JOIN (unlike MERGE JOIN) doesn't care about the order of values, this access path is exactly what we need.

If we look for the missing grouper's:

SELECT  COUNT(*)
FROM    "20090421_in".table1 t1
WHERE   grouper NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )

cnt
1000
1 row selected in 0,269 seconds

, we get almost same execution time of 269 ms and the following plan:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  HASH JOIN RIGHT ANTI
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER

, which is the same as the previous one.

Now, let's try the tricks we used to optimize this query in MySQL

Joining on DISTINCT:

SELECT  COUNT(*)
FROM    (
        SELECT	DISTINCT grouper
        FROM	"20090421_in".table1
        ) d
JOIN    "20090421_in".table1 t1
ON      t1.grouper = d.grouper
WHERE   d.grouper NOT IN
        (
        SELECT	id
        FROM	"20090421_in".table2 t2
        )

cnt
1000
1 row selected in 0,259 seconds

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  NESTED LOOPS
   NESTED LOOPS ANTI
    VIEW
     HASH UNIQUE
      INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER
    INDEX UNIQUE SCAN, 20090421_in, PK_TABLE2_ID
   INDEX RANGE SCAN, 20090421_in, IX_TABLE1_GROUPER

This query uses HASH UNIQUE algorithm for seleting DISTINCT grouper's from table1, which implies building a hash table over the values, discarding the duplicates.

The resultset is then ANTI JOIN'ed with table2 to produce the missing grouper's, and, finally, the missing grouper's are looked up in table1 using INDEX RANGE SCAN and counted.

As hashing the large rowset of 1,000,000 rows in both queries is the longest operation, this query is of same efficiency as the original query.

Now, let's try reverse COUNT(*):

SELECT  (
        SELECT  COUNT(*)
        FROM    "20090421_in".table1 t1
        ) -
        (
        SELECT  COUNT(*)
        FROM    "20090421_in".table1 t1
        WHERE   id IN
                (
                SELECT  id
                FROM    "20090421_in".table2 t2
                )
        )
FROM    dual

cnt
999000
1 row selected in 0,065 seconds

As with MySQL and SQL Server, this trick greatly improves the query (it runs for 65 ms, four times as fast).

The execution plan says:

SELECT STATEMENT, GOAL = ALL_ROWS
 SORT AGGREGATE
  INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID
 SORT AGGREGATE
  NESTED LOOPS
   INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID
   INDEX UNIQUE SCAN, 20090421_in, PK_TABLE1_ID
 FAST DUAL

By changing the join condition, we made possible for table2 to be leading in the join.

This plan uses NESTED LOOPS with INDEX UNIQUE SCAN on table1, which eliminates the need in costly lookups over 1,000,000 rows. The engine still needs to COUNT(*) the rows, but it's much faster than looking for actual values.

Summary: for NOT IN predicate in the tables above, Oracle uses HASH ANTI JOIN algorithm, which implies building a hash table from the results of the inner subquery.

This algorithm benefits from using INDEX FAST FULL SCAN, since no particular order is required for building and searching the hash table.

The query cannot be improved much by joining on DISTINCT grouper's, since Oracle still needs to use the large resultset for the cache lookups, but benefits from reversing the predicate and subtracting the result from the total COUNT(*).

To be continued.

Written by Quassnoi

April 21st, 2009 at 11:00 pm

Posted in Oracle

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

EXPLAIN EXTENDED

Counting missing rows: Oracle

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Counting missing rows: Oracle

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow