Counting missing rows: Oracle
Comments enabled. I *really* need your comment
This is the 4th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Let's see how Oracle processes this predicate.
Let's create the sample tables:
Table creation details
CREATE USER "20090421_in" IDENTIFIED BY "20090421_in" / GRANT CREATE ANY TABLE TO "20090421_in" / GRANT UNLIMITED TABLESPACE TO "20090421_in" / ALTER SESSION SET CURRENT_SCHEMA="20090421_in" / CREATE TABLE table1 ( id NUMBER(10) NOT NULL, grouper NUMBER(10) NOT NULL, value VARCHAR2(50) ) / ALTER TABLE table1 ADD CONSTRAINT pk_table1_id PRIMARY KEY (id) / CREATE INDEX ix_table1_grouper ON table1 (grouper) / CREATE TABLE table2 ( id NUMBER(10) NOT NULL, value VARCHAR2(50) ) / ALTER TABLE table2 ADD CONSTRAINT pk_table2_id PRIMARY KEY (id) / INSERT INTO table1 (id, grouper, value) SELECT level, MOD(level, 1000), 'Value One ' || level FROM dual CONNECT BY level <= 1000000 / INSERT INTO table2 (id, value) SELECT level, 'Value Two ' || level FROM dual CONNECT BY level <= 1000 / COMMIT / BEGIN DBMS_STATS.gather_schema_stats('"20090421_in"'); END; / [/sourcecode] </div> As in the previous articles, we have two tables: <code>table1</code> with <strong>1,000,000</strong> rows and <code>table2</code> with <strong>1,000</strong> rows. The original query: SELECT COUNT(*) FROM "20090421_in".table1 t1 WHERE id NOT IN ( SELECT id FROM "20090421_in".table2 t2 )
cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row selected in 0,249 seconds |
runs for 249 ms and produces the following execution plan:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE HASH JOIN RIGHT ANTI INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID
Oracle employs an ANTI JOIN
algorithm here, but instead of MERGE ANTI JOIN
used by SQL Server, Oracle's optimizer chooses a HASH ANTI JOIN
algorithm.
This algorithm performs the following steps:
- Build a hash table over the values from the smaller dataset
- Use this hash table to look up the values from the larger dataset
As we can see, this algorithm is quite efficient, as it uses a special access path called INDEX FAST FULL SCAN
to build the rowsets.
This access path scans the index pages in order they reside in the datafiles, ignoring the index order. This is faster than traversing the index tree, uses sequential disk I/O
and can be parallelized.
Since all the values we need for the join are contained in the index itself and HASH JOIN
(unlike MERGE JOIN
) doesn't care about the order of values, this access path is exactly what we need.
If we look for the missing grouper
's:
SELECT COUNT(*) FROM "20090421_in".table1 t1 WHERE grouper NOT IN ( SELECT id FROM "20090421_in".table2 t2 )
cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row selected in 0,269 seconds |
, we get almost same execution time of 269 ms and the following plan:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE HASH JOIN RIGHT ANTI INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER
, which is the same as the previous one.
Now, let's try the tricks we used to optimize this query in MySQL
SELECT COUNT(*) FROM ( SELECT DISTINCT grouper FROM "20090421_in".table1 ) d JOIN "20090421_in".table1 t1 ON t1.grouper = d.grouper WHERE d.grouper NOT IN ( SELECT id FROM "20090421_in".table2 t2 )
cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row selected in 0,259 seconds |
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE NESTED LOOPS NESTED LOOPS ANTI VIEW HASH UNIQUE INDEX FAST FULL SCAN, 20090421_in, IX_TABLE1_GROUPER INDEX UNIQUE SCAN, 20090421_in, PK_TABLE2_ID INDEX RANGE SCAN, 20090421_in, IX_TABLE1_GROUPER
This query uses HASH UNIQUE
algorithm for seleting DISTINCT grouper
's from table1
, which implies building a hash table over the values, discarding the duplicates.
The resultset is then ANTI JOIN
'ed with table2
to produce the missing grouper
's, and, finally, the missing grouper
's are looked up in table1
using INDEX RANGE SCAN
and counted.
As hashing the large rowset of 1,000,000 rows in both queries is the longest operation, this query is of same efficiency as the original query.
Now, let's try reverse COUNT(*)
:
SELECT ( SELECT COUNT(*) FROM "20090421_in".table1 t1 ) - ( SELECT COUNT(*) FROM "20090421_in".table1 t1 WHERE id IN ( SELECT id FROM "20090421_in".table2 t2 ) ) FROM dual
cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row selected in 0,065 seconds |
As with MySQL and SQL Server, this trick greatly improves the query (it runs for 65 ms, four times as fast).
The execution plan says:
SELECT STATEMENT, GOAL = ALL_ROWS SORT AGGREGATE INDEX FAST FULL SCAN, 20090421_in, PK_TABLE1_ID SORT AGGREGATE NESTED LOOPS INDEX FAST FULL SCAN, 20090421_in, PK_TABLE2_ID INDEX UNIQUE SCAN, 20090421_in, PK_TABLE1_ID FAST DUAL
By changing the join condition, we made possible for table2
to be leading in the join.
This plan uses NESTED LOOPS
with INDEX UNIQUE SCAN
on table1
, which eliminates the need in costly lookups over 1,000,000 rows. The engine still needs to COUNT(*)
the rows, but it's much faster than looking for actual values.
Summary: for NOT IN
predicate in the tables above, Oracle uses HASH ANTI JOIN
algorithm, which implies building a hash table from the results of the inner subquery.
This algorithm benefits from using INDEX FAST FULL SCAN
, since no particular order is required for building and searching the hash table.
The query cannot be improved much by joining on DISTINCT grouper
's, since Oracle still needs to use the large resultset for the cache lookups, but benefits from reversing the predicate and subtracting the result from the total COUNT(*)
.
To be continued.
Written by Quassnoi
April 21st, 2009 at 11:00 pm
Posted in Oracle
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009