Counting missing rows: Oracle
This is the 4th of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:
NOT INin MySQLNOT INin MySQL (usingDISTINCT)NOT INin SQL ServerNOT INin OracleNOT INin PostgreSQL
Let’s see how Oracle processes this predicate.
Let’s create the sample tables:
Table creation details
As in the previous articles, we have two tables: table1 with 1,000,000 rows and table2 with 1,000 rows.
The original query:
SELECT COUNT(*)
FROM "20090421_in".table1 t1
WHERE id NOT IN
(
SELECT id
FROM "20090421_in".table2 t2
)
| cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row selected in 0,249 seconds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
runs for 249 ms and produces the following execution plan:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE HASH JOIN RIGHT ANTI INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID
Oracle employs an ANTI JOIN algorithm here, but instead of MERGE ANTI JOIN used by SQL Server, Oracle‘s optimizer chooses a HASH ANTI JOIN algorithm.
This algorithm performs the following steps:
- Build a hash table over the values from the smaller dataset
- Use this hash table to look up the values from the larger dataset
As we can see, this algorithm is quite efficient, as it uses a special access path called INDEX FAST FULL SCAN to build the rowsets.
This access path scans the index pages in order they reside in the datafiles, ignoring the index order. This is faster than traversing the index tree, uses sequential disk I/O and can be parallelized.
Since all the values we need for the join are contained in the index itself and HASH JOIN (unlike MERGE JOIN) doesn’t care about the order of values, this access path is exactly what we need.
If we look for the missing grouper‘s:
SELECT COUNT(*)
FROM "20090421_in".table1 t1
WHERE grouper NOT IN
(
SELECT id
FROM "20090421_in".table2 t2
)
| cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row selected in 0,269 seconds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
, we get almost same execution time of 269 ms and the following plan:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE HASH JOIN RIGHT ANTI INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER
, which is the same as the previous one.
Now, let’s try the tricks we used to optimize this query in MySQL
SELECT COUNT(*)
FROM (
SELECT DISTINCT grouper
FROM "20090421_in".table1
) d
JOIN "20090421_in".table1 t1
ON t1.grouper = d.grouper
WHERE d.grouper NOT IN
(
SELECT id
FROM "20090421_in".table2 t2
)
| cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row selected in 0,259 seconds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SELECT STATEMENT, GOAL = ALL_ROWS
SORT AGGREGATE
NESTED LOOPS
NESTED LOOPS ANTI
VIEW
HASH UNIQUE
INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER
INDEX UNIQUE SCAN, 20090421_in, PK_TABLE2_ID
INDEX RANGE SCAN, 20090421_in, IX_TABLE1_GROUPER
This query uses HASH UNIQUE algorithm for seleting DISTINCT grouper‘s from table1, which implies building a hash table over the values, discarding the duplicates.
The resultset is then ANTI JOIN‘ed with table2 to produce the missing grouper‘s, and, finally, the missing grouper‘s are looked up in table1 using INDEX RANGE SCAN and counted.
As hashing the large rowset of 1,000,000 rows in both queries is the longest operation, this query is of same efficiency as the original query.
Now, let’s try reverse COUNT(*):
SELECT (
SELECT COUNT(*)
FROM "20090421_in".table1 t1
) -
(
SELECT COUNT(*)
FROM "20090421_in".table1 t1
WHERE id IN
(
SELECT id
FROM "20090421_in".table2 t2
)
)
FROM dual
| cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row selected in 0,065 seconds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As with MySQL and SQL Server, this trick greatly improves the query (it runs for 65 ms, four times as fast).
The execution plan says:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID SORT AGGREGATE NESTED LOOPS INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX UNIQUE SCAN, 20090421_in, PK_TABLE1_ID FAST DUAL
By changing the join condition, we made possible for table2 to be leading in the join.
This plan uses NESTED LOOPS with INDEX UNIQUE SCAN on table1, which eliminates the need in costly lookups over 1,000,000 rows. The engine still needs to COUNT(*) the rows, but it’s much faster than looking for actual values.
Summary: for NOT IN predicate in the tables above, Oracle uses HASH ANTI JOIN algorithm, which implies building a hash table from the results of the inner subquery.
This algorithm benefits from using INDEX FAST FULL SCAN, since no particular order is required for building and searching the hash table.
The query cannot be improved much by joining on DISTINCT grouper‘s, since Oracle still needs to use the large resultset for the cache lookups, but benefits from reversing the predicate and subtracting the result from the total COUNT(*).
To be continued.
Subscribe in a reader