Oracle: OR on multiple EXISTS clauses
Comments enabled. I *really* need your comment
From Stack Overflow:
I have two queries, and I want to understand which is better in terms of performance and memory:
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_b b, tbl_a a , tbl_c c, tbl_d d WHERE ( b.id1 = a.id1 AND a.id1 = c.id1 AND UPPER(c.flag) = 'Y' AND c.id1 = d.id1 ) OR ( b.id2 = a.id2 AND a.id2 = c.id2 AND UPPER(c.flag) = 'Y' AND c.id2 = d.id2 ) AND d.id3 = 10and
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT a.id1, a.id2 FROM tbl_c c WHERE (a.id1 = c.id1 OR a.id2 = c.id2) AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_b b WHERE b.id1 = a.id1 OR b.id2 = a.id2 ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_d d WHERE (a.id1 = d.id1 or a.id2 = d.id2) AND d.id3 = 10 )The tables
tbl_b
andtbl_d
are very large tables containing 500,000 to millions of rows, while tabletbl_a
is relatively small.My requirement is to pick up only those records from table
tbl_a
, whoseid
(eitherid1
orid2
) is available intbl_b
,tbl_c
, andtbl_d
tables, satisfying certain other conditions as well.Which is best performance-wise?
We can see that both these queries contain an OR
condition, a nightmare for most optimizers.
The first query uses a join on all four tables, concatenating the results and making a distinct set out of them.
The second query check each row in tbl_a
, making sure that the corresponding records exists in other tables in one or another way.
These queries are not identical: the first query will select the rows from tbl_a
matching all tables on same id (either three matches on id1
or three matches on id2
), while the second query returns rows matching on any id
This is, if we have a row matching tbl_b
and tbl_c
on id1
and tbl_d
on id2
, this row will be returned by the second query but not the first.
Both these queries will perform poorly on large tables. However, we can improve them.
Let's create the tables, fill them with sample data and make the improvements:
CREATE TABLE tbl_a ( id1 NUMBER(10) NOT NULL, id2 NUMBER(10) NOT NULL, no VARCHAR(20) NOT NULL ) / CREATE TABLE tbl_b ( id1 NUMBER(10) NOT NULL, id2 NUMBER(10) NOT NULL, value VARCHAR(20) NOT NULL ) / CREATE TABLE tbl_c ( id1 NUMBER(10) NOT NULL, id2 NUMBER(10) NOT NULL, flag CHAR(1) NOT NULL, value VARCHAR(20) NOT NULL ) / CREATE TABLE tbl_d ( id1 NUMBER(10) NOT NULL, id2 NUMBER(10) NOT NULL, id3 NUMBER(10) NOT NULL, value VARCHAR(20) NOT NULL ) / CREATE INDEX ix_b_1 ON tbl_b (id1) / CREATE INDEX ix_b_2 ON tbl_b (id2) / CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag)) / CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag)) / CREATE INDEX ix_d_13 ON tbl_d (id1, id3) / CREATE INDEX ix_d_23 ON tbl_d (id2, id3) / BEGIN DBMS_RANDOM.seed(20090715); END; / INSERT INTO tbl_a (id1, id2, no) SELECT level, level + 100, 'No ' || level FROM dual CONNECT BY level <= 100 / INSERT INTO tbl_b (id1, id2, value) SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, 'Value ' || level FROM dual CONNECT BY level <= 1000000 / INSERT INTO tbl_c (id1, id2, flag, value) SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, DECODE(MOD(level, 100), 0, 'N', 1, 'y', 'Y'), 'Value ' || level FROM dual CONNECT BY level <= 1000000 / INSERT INTO tbl_d (id1, id2, id3, value) SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, MOD(level, 300), 'Value ' || level FROM dual CONNECT BY level <= 1000000 / COMMIT / [/sourcecode] There are <strong>100</strong> rows in <code>tbl_a</code>, and <strong>1,000,000</strong> in <code>tbl_b</code>, <code>tbl_c</code> and <code>tbl_d</code>. The field combinations we are joining the tables on are indexed: CREATE INDEX ix_b_1 ON tbl_b (id1) / CREATE INDEX ix_b_2 ON tbl_b (id2) / CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag)) / CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag)) / CREATE INDEX ix_d_13 ON tbl_d (id1, id3) / CREATE INDEX ix_d_23 ON tbl_d (id2, id3) /
Note the function based index on tbl_c (id2, UPPER(flag))
.
Oracle optimizer cannot efficiently handle the OR
conditions in both above queries.
That's why it users intermediate HASH JOIN
s (which yield billions of rows) and then tries to DISTINCT
them using HASH UNIQUE
.
The first query I had to cancel after 3hours :
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_b b, tbl_a a , tbl_c c, tbl_d d WHERE ( b.id1 = a.id1 AND a.id1 = c.id1 AND UPPER(c.flag) = 'Y' AND c.id1 = d.id1 ) OR ( b.id2 = a.id2 AND a.id2 = c.id2 AND UPPER(c.flag) = 'Y' AND c.id2 = d.id2 ) AND d.id3 = 10
SELECT STATEMENT HASH UNIQUE CONCATENATION HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_D HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_D HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B
As you can see, it uses a stack of HASH JOIN
s (which produce a very large intermediate resultset), and then tries to shrink this resultset using HASH UNIQUE
.
The second query does the same, but produces yet larger intermediate resultset, since it has to check for 7 possible combinations (instead of 2):
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT a.id1, a.id2 FROM tbl_c c WHERE (a.id1 = c.id1 OR a.id2 = c.id2) AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_b b WHERE b.id1 = a.id1 OR b.id2 = a.id2 ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_d d WHERE (a.id1 = d.id1 or a.id2 = d.id2) AND d.id3 = 10 )
SELECT STATEMENT HASH UNIQUE CONCATENATION HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A MERGE JOIN CARTESIAN TABLE ACCESS FULL, 20090715_or.TBL_D BUFFER SORT TABLE ACCESS FULL, 20090715_or.TBL_C HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_D HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_B TABLE ACCESS FULL, 20090715_or.TBL_C HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_C HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_D TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_D HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_D TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_D HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B HASH JOIN HASH JOIN HASH JOIN TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_D TABLE ACCESS FULL, 20090715_or.TBL_C TABLE ACCESS FULL, 20090715_or.TBL_B
How can we improve this query?
To do this, we should get rid of all OR
conditions inside the subqueries.
This can be done using UNION
.
To rewrite the first query, we should find all rows from tbl_a
that have matches on id1
in all three tables, then do the same for id2
and UNION
the results. Since there is a single equijoin condition for each subquery, Oracle's optimizer can employ efficient SEMI JOIN
algorithms, and since the tbl_a
is small, merging the results will be almost instant:
SELECT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT 1 FROM tbl_b b WHERE b.id1 = a.id1 ) AND EXISTS ( SELECT 1 FROM tbl_c c WHERE c.id1 = a.id1 AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT 1 FROM tbl_d d WHERE d.id1 = a.id1 AND d.id3 = 10 ) UNION SELECT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT 1 FROM tbl_b b WHERE b.id2 = a.id2 ) AND EXISTS ( SELECT 1 FROM tbl_c c WHERE c.id2 = a.id2 AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT 1 FROM tbl_d d WHERE d.id2 = a.id2 AND d.id3 = 10 )
NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0085s) |
SELECT STATEMENT SORT UNIQUE UNION-ALL NESTED LOOPS SEMI NESTED LOOPS SEMI HASH JOIN SEMI TABLE ACCESS FULL, 20090715_or.TBL_A INDEX SKIP SCAN, 20090715_or.IX_D_13 INDEX RANGE SCAN, 20090715_or.IX_B_1 INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG NESTED LOOPS SEMI NESTED LOOPS SEMI HASH JOIN SEMI TABLE ACCESS FULL, 20090715_or.TBL_A INDEX SKIP SCAN, 20090715_or.IX_D_23 INDEX RANGE SCAN, 20090715_or.IX_B_2 INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG
Each of two subqueries use a semijoin method on corresponding indexes, and the whole query completes in but 80 ms, which is almost instant.
Note that were there no indexes on the tables, this would still be quite efficient, since in this case HASH SEMI
operations wouldn't increase the intermediate resultset but decrease it:
SELECT /*+ FULL(a) */ a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT /*+ FULL(b) */ 1 FROM tbl_b b WHERE b.id1 = a.id1 ) AND EXISTS ( SELECT /*+ FULL(c) */ 1 FROM tbl_c c WHERE c.id1 = a.id1 AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT /*+ FULL(d) */ 1 FROM tbl_d d WHERE d.id1 = a.id1 AND d.id3 = 10 ) UNION SELECT /*+ FULL(a) */ a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT /*+ FULL(b) */ 1 FROM tbl_b b WHERE b.id2 = a.id2 ) AND EXISTS ( SELECT /*+ FULL(c) */ 1 FROM tbl_c c WHERE c.id2 = a.id2 AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT /*+ FULL(d) */ 1 FROM tbl_d d WHERE d.id2 = a.id2 AND d.id3 = 10 )
NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (1.5625s) |
SELECT STATEMENT SORT UNIQUE UNION-ALL HASH JOIN SEMI HASH JOIN SEMI HASH JOIN SEMI TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_D TABLE ACCESS FULL, 20090715_or.TBL_B TABLE ACCESS FULL, 20090715_or.TBL_C HASH JOIN SEMI HASH JOIN SEMI HASH JOIN SEMI TABLE ACCESS FULL, 20090715_or.TBL_A TABLE ACCESS FULL, 20090715_or.TBL_D TABLE ACCESS FULL, 20090715_or.TBL_B TABLE ACCESS FULL, 20090715_or.TBL_C
I added hints to disable indexes, but the query still completes in 1.56 seconds
Now, let's try to emulate the second query, which allows matching different tables on different id
s.
To do this, we can employ a simple filtering technique.
Since all our tables are properly indexed, we can just use a logical combination of different EXISTS
conditions (as opposed to pushing the logical conditions into EXISTS
predicates themselves).
Each single EXISTS
condition requires but one index scan operation, and the whole query, therefore, will take but 1200 index lookups, which is nothing.
SELECT a.no, a.id1, a.id2 FROM "20090715_or".tbl_a a WHERE ( EXISTS ( SELECT 1 FROM "20090715_or".tbl_b b WHERE b.id1 = a.id1 ) OR EXISTS ( SELECT 1 FROM "20090715_or".tbl_b b WHERE b.id2 = a.id2 ) ) AND ( EXISTS ( SELECT 1 FROM "20090715_or".tbl_c c WHERE c.id1 = a.id1 AND UPPER(c.flag) = 'Y' ) OR EXISTS ( SELECT 1 FROM "20090715_or".tbl_c c WHERE c.id2 = a.id2 AND UPPER(c.flag) = 'Y' ) ) AND ( EXISTS ( SELECT 1 FROM "20090715_or".tbl_d d WHERE d.id2 = a.id2 AND d.id3 = 10 ) OR EXISTS ( SELECT 1 FROM "20090715_or".tbl_d d WHERE d.id2 = a.id2 AND d.id3 = 10 ) )
NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0018s) |
SELECT STATEMENT FILTER TABLE ACCESS FULL, 20090715_or.TBL_A INDEX RANGE SCAN, 20090715_or.IX_B_1 INDEX RANGE SCAN, 20090715_or.IX_B_2 INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG INDEX RANGE SCAN, 20090715_or.IX_D_23 INDEX RANGE SCAN, 20090715_or.IX_D_23
18 ms, which is literally nothing.
This solution can be applied to the first query as well, and probably will be as efficient in the first case too.
I just wanted to demonstrate that transforming an OR
query to a UNION
of AND
queries can help the optimizer to choose right joining scenario for each separate query and boost overall query performance.