Oracle: OR on multiple EXISTS clauses
Comments enabled. I *really* need your comment
From Stack Overflow:
I have two queries, and I want to understand which is better in terms of performance and memory:
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_b b, tbl_a a , tbl_c c, tbl_d d WHERE ( b.id1 = a.id1 AND a.id1 = c.id1 AND UPPER(c.flag) = 'Y' AND c.id1 = d.id1 ) OR ( b.id2 = a.id2 AND a.id2 = c.id2 AND UPPER(c.flag) = 'Y' AND c.id2 = d.id2 ) AND d.id3 = 10and
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT a.id1, a.id2 FROM tbl_c c WHERE (a.id1 = c.id1 OR a.id2 = c.id2) AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_b b WHERE b.id1 = a.id1 OR b.id2 = a.id2 ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_d d WHERE (a.id1 = d.id1 or a.id2 = d.id2) AND d.id3 = 10 )The tables
tbl_bandtbl_dare very large tables containing 500,000 to millions of rows, while tabletbl_ais relatively small.My requirement is to pick up only those records from table
tbl_a, whoseid(eitherid1orid2) is available intbl_b,tbl_c, andtbl_dtables, satisfying certain other conditions as well.Which is best performance-wise?
We can see that both these queries contain an OR condition, a nightmare for most optimizers.
The first query uses a join on all four tables, concatenating the results and making a distinct set out of them.
The second query check each row in tbl_a, making sure that the corresponding records exists in other tables in one or another way.
These queries are not identical: the first query will select the rows from tbl_a matching all tables on same id (either three matches on id1 or three matches on id2), while the second query returns rows matching on any id
This is, if we have a row matching tbl_b and tbl_c on id1 and tbl_d on id2, this row will be returned by the second query but not the first.
Both these queries will perform poorly on large tables. However, we can improve them.
Let's create the tables, fill them with sample data and make the improvements:
CREATE TABLE tbl_a
(
id1 NUMBER(10) NOT NULL,
id2 NUMBER(10) NOT NULL,
no VARCHAR(20) NOT NULL
)
/
CREATE TABLE tbl_b
(
id1 NUMBER(10) NOT NULL,
id2 NUMBER(10) NOT NULL,
value VARCHAR(20) NOT NULL
)
/
CREATE TABLE tbl_c
(
id1 NUMBER(10) NOT NULL,
id2 NUMBER(10) NOT NULL,
flag CHAR(1) NOT NULL,
value VARCHAR(20) NOT NULL
)
/
CREATE TABLE tbl_d
(
id1 NUMBER(10) NOT NULL,
id2 NUMBER(10) NOT NULL,
id3 NUMBER(10) NOT NULL,
value VARCHAR(20) NOT NULL
)
/
CREATE INDEX ix_b_1 ON tbl_b (id1)
/
CREATE INDEX ix_b_2 ON tbl_b (id2)
/
CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag))
/
CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag))
/
CREATE INDEX ix_d_13 ON tbl_d (id1, id3)
/
CREATE INDEX ix_d_23 ON tbl_d (id2, id3)
/
BEGIN
DBMS_RANDOM.seed(20090715);
END;
/
INSERT
INTO tbl_a (id1, id2, no)
SELECT level, level + 100, 'No ' || level
FROM dual
CONNECT BY
level <= 100
/
INSERT
INTO tbl_b (id1, id2, value)
SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, 'Value ' || level
FROM dual
CONNECT BY
level <= 1000000
/
INSERT
INTO tbl_c (id1, id2, flag, value)
SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, DECODE(MOD(level, 100), 0, 'N', 1, 'y', 'Y'), 'Value ' || level
FROM dual
CONNECT BY
level <= 1000000
/
INSERT
INTO tbl_d (id1, id2, id3, value)
SELECT MOD(level, 10) + 1, MOD(level, 10) + 101, MOD(level, 300), 'Value ' || level
FROM dual
CONNECT BY
level <= 1000000
/
COMMIT
/
[/sourcecode]
There are <strong>100</strong> rows in <code>tbl_a</code>, and <strong>1,000,000</strong> in <code>tbl_b</code>, <code>tbl_c</code> and <code>tbl_d</code>.
The field combinations we are joining the tables on are indexed:
CREATE INDEX ix_b_1 ON tbl_b (id1)
/
CREATE INDEX ix_b_2 ON tbl_b (id2)
/
CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag))
/
CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag))
/
CREATE INDEX ix_d_13 ON tbl_d (id1, id3)
/
CREATE INDEX ix_d_23 ON tbl_d (id2, id3)
/
Note the function based index on tbl_c (id2, UPPER(flag)).
Oracle optimizer cannot efficiently handle the OR conditions in both above queries.
That's why it users intermediate HASH JOINs (which yield billions of rows) and then tries to DISTINCT them using HASH UNIQUE.
The first query I had to cancel after 3hours :
SELECT DISTINCT
a.no,
a.id1,
a.id2
FROM tbl_b b,
tbl_a a ,
tbl_c c,
tbl_d d
WHERE (
b.id1 = a.id1
AND a.id1 = c.id1
AND UPPER(c.flag) = 'Y'
AND c.id1 = d.id1
)
OR
(
b.id2 = a.id2
AND a.id2 = c.id2
AND UPPER(c.flag) = 'Y'
AND c.id2 = d.id2
)
AND d.id3 = 10
SELECT STATEMENT
HASH UNIQUE
CONCATENATION
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_D
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_D
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
As you can see, it uses a stack of HASH JOINs (which produce a very large intermediate resultset), and then tries to shrink this resultset using HASH UNIQUE.
The second query does the same, but produces yet larger intermediate resultset, since it has to check for 7 possible combinations (instead of 2):
SELECT DISTINCT
a.no,
a.id1,
a.id2
FROM tbl_a a
WHERE EXISTS
(
SELECT a.id1, a.id2
FROM tbl_c c
WHERE (a.id1 = c.id1 OR a.id2 = c.id2)
AND UPPER(c.flag) = 'Y'
)
AND EXISTS
(
SELECT a.id1, a.id2
FROM tbl_b b
WHERE b.id1 = a.id1 OR b.id2 = a.id2
)
AND EXISTS
(
SELECT a.id1, a.id2
FROM tbl_d d
WHERE (a.id1 = d.id1 or a.id2 = d.id2)
AND d.id3 = 10
)
SELECT STATEMENT
HASH UNIQUE
CONCATENATION
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
MERGE JOIN CARTESIAN
TABLE ACCESS FULL, 20090715_or.TBL_D
BUFFER SORT
TABLE ACCESS FULL, 20090715_or.TBL_C
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_D
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_B
TABLE ACCESS FULL, 20090715_or.TBL_C
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_C
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_D
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_D
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_D
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_D
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
HASH JOIN
HASH JOIN
HASH JOIN
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_D
TABLE ACCESS FULL, 20090715_or.TBL_C
TABLE ACCESS FULL, 20090715_or.TBL_B
How can we improve this query?
To do this, we should get rid of all OR conditions inside the subqueries.
This can be done using UNION.
To rewrite the first query, we should find all rows from tbl_a that have matches on id1 in all three tables, then do the same for id2 and UNION the results. Since there is a single equijoin condition for each subquery, Oracle's optimizer can employ efficient SEMI JOIN algorithms, and since the tbl_a is small, merging the results will be almost instant:
SELECT a.no,
a.id1,
a.id2
FROM tbl_a a
WHERE EXISTS
(
SELECT 1
FROM tbl_b b
WHERE b.id1 = a.id1
)
AND EXISTS
(
SELECT 1
FROM tbl_c c
WHERE c.id1 = a.id1
AND UPPER(c.flag) = 'Y'
)
AND EXISTS
(
SELECT 1
FROM tbl_d d
WHERE d.id1 = a.id1
AND d.id3 = 10
)
UNION
SELECT a.no,
a.id1,
a.id2
FROM tbl_a a
WHERE EXISTS
(
SELECT 1
FROM tbl_b b
WHERE b.id2 = a.id2
)
AND EXISTS
(
SELECT 1
FROM tbl_c c
WHERE c.id2 = a.id2
AND UPPER(c.flag) = 'Y'
)
AND EXISTS
(
SELECT 1
FROM tbl_d d
WHERE d.id2 = a.id2
AND d.id3 = 10
)
| NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (0.0085s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SELECT STATEMENT
SORT UNIQUE
UNION-ALL
NESTED LOOPS SEMI
NESTED LOOPS SEMI
HASH JOIN SEMI
TABLE ACCESS FULL, 20090715_or.TBL_A
INDEX SKIP SCAN, 20090715_or.IX_D_13
INDEX RANGE SCAN, 20090715_or.IX_B_1
INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG
NESTED LOOPS SEMI
NESTED LOOPS SEMI
HASH JOIN SEMI
TABLE ACCESS FULL, 20090715_or.TBL_A
INDEX SKIP SCAN, 20090715_or.IX_D_23
INDEX RANGE SCAN, 20090715_or.IX_B_2
INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG
Each of two subqueries use a semijoin method on corresponding indexes, and the whole query completes in but 80 ms, which is almost instant.
Note that were there no indexes on the tables, this would still be quite efficient, since in this case HASH SEMI operations wouldn't increase the intermediate resultset but decrease it:
SELECT /*+ FULL(a) */
a.no,
a.id1,
a.id2
FROM tbl_a a
WHERE EXISTS
(
SELECT /*+ FULL(b) */
1
FROM tbl_b b
WHERE b.id1 = a.id1
)
AND EXISTS
(
SELECT /*+ FULL(c) */
1
FROM tbl_c c
WHERE c.id1 = a.id1
AND UPPER(c.flag) = 'Y'
)
AND EXISTS
(
SELECT /*+ FULL(d) */
1
FROM tbl_d d
WHERE d.id1 = a.id1
AND d.id3 = 10
)
UNION
SELECT /*+ FULL(a) */
a.no,
a.id1,
a.id2
FROM tbl_a a
WHERE EXISTS
(
SELECT /*+ FULL(b) */
1
FROM tbl_b b
WHERE b.id2 = a.id2
)
AND EXISTS
(
SELECT /*+ FULL(c) */
1
FROM tbl_c c
WHERE c.id2 = a.id2
AND UPPER(c.flag) = 'Y'
)
AND EXISTS
(
SELECT /*+ FULL(d) */
1
FROM tbl_d d
WHERE d.id2 = a.id2
AND d.id3 = 10
)
| NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (1.5625s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SELECT STATEMENT
SORT UNIQUE
UNION-ALL
HASH JOIN SEMI
HASH JOIN SEMI
HASH JOIN SEMI
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_D
TABLE ACCESS FULL, 20090715_or.TBL_B
TABLE ACCESS FULL, 20090715_or.TBL_C
HASH JOIN SEMI
HASH JOIN SEMI
HASH JOIN SEMI
TABLE ACCESS FULL, 20090715_or.TBL_A
TABLE ACCESS FULL, 20090715_or.TBL_D
TABLE ACCESS FULL, 20090715_or.TBL_B
TABLE ACCESS FULL, 20090715_or.TBL_C
I added hints to disable indexes, but the query still completes in 1.56 seconds
Now, let's try to emulate the second query, which allows matching different tables on different ids.
To do this, we can employ a simple filtering technique.
Since all our tables are properly indexed, we can just use a logical combination of different EXISTS conditions (as opposed to pushing the logical conditions into EXISTS predicates themselves).
Each single EXISTS condition requires but one index scan operation, and the whole query, therefore, will take but 1200 index lookups, which is nothing.
SELECT a.no,
a.id1,
a.id2
FROM "20090715_or".tbl_a a
WHERE (
EXISTS
(
SELECT 1
FROM "20090715_or".tbl_b b
WHERE b.id1 = a.id1
)
OR EXISTS
(
SELECT 1
FROM "20090715_or".tbl_b b
WHERE b.id2 = a.id2
)
)
AND
(
EXISTS
(
SELECT 1
FROM "20090715_or".tbl_c c
WHERE c.id1 = a.id1
AND UPPER(c.flag) = 'Y'
)
OR EXISTS
(
SELECT 1
FROM "20090715_or".tbl_c c
WHERE c.id2 = a.id2
AND UPPER(c.flag) = 'Y'
)
)
AND
(
EXISTS
(
SELECT 1
FROM "20090715_or".tbl_d d
WHERE d.id2 = a.id2
AND d.id3 = 10
)
OR
EXISTS
(
SELECT 1
FROM "20090715_or".tbl_d d
WHERE d.id2 = a.id2
AND d.id3 = 10
)
)
| NO | ID1 | ID2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No 1 | 1 | 101 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (0.0018s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SELECT STATEMENT
FILTER
TABLE ACCESS FULL, 20090715_or.TBL_A
INDEX RANGE SCAN, 20090715_or.IX_B_1
INDEX RANGE SCAN, 20090715_or.IX_B_2
INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG
INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG
INDEX RANGE SCAN, 20090715_or.IX_D_23
INDEX RANGE SCAN, 20090715_or.IX_D_23
18 ms, which is literally nothing.
This solution can be applied to the first query as well, and probably will be as efficient in the first case too.
I just wanted to demonstrate that transforming an OR query to a UNION of AND queries can help the optimizer to choose right joining scenario for each separate query and boost overall query performance.
Subscribe in a reader