Oracle: OR on multiple EXISTS clauses at EXPLAIN EXTENDED

I have two queries, and I want to understand which is better in terms of performance and memory:
SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_b b,
        tbl_a a ,
        tbl_c c,
        tbl_d d
WHERE   (
        b.id1 = a.id1
        AND a.id1 = c.id1
        AND UPPER(c.flag) = 'Y'
        AND c.id1 = d.id1
        )
        OR
        (
        b.id2 = a.id2
        AND a.id2 = c.id2
        AND UPPER(c.flag) = 'Y'
        AND c.id2 = d.id2
        )
        AND d.id3 = 10
and
SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_c c
        WHERE   (a.id1 = c.id1 OR a.id2 = c.id2)
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_b b
        WHERE   b.id1 = a.id1 OR b.id2 = a.id2
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_d d
        WHERE   (a.id1 = d.id1 or a.id2 = d.id2)
                AND d.id3 = 10
        )
The tables tbl_b and tbl_d are very large tables containing 500,000 to millions of rows, while table tbl_a is relatively small.

My requirement is to pick up only those records from table tbl_a, whose id (either id1 or id2) is available in tbl_b, tbl_c, and tbl_d tables, satisfying certain other conditions as well.

Which is best performance-wise?

We can see that both these queries contain an OR condition, a nightmare for most optimizers.

The first query uses a join on all four tables, concatenating the results and making a distinct set out of them.

The second query check each row in tbl_a, making sure that the corresponding records exists in other tables in one or another way.

These queries are not identical: the first query will select the rows from tbl_a matching all tables on same id (either three matches on id1 or three matches on id2), while the second query returns rows matching on any id

This is, if we have a row matching tbl_b and tbl_c on id1 and tbl_d on id2, this row will be returned by the second query but not the first.

Both these queries will perform poorly on large tables. However, we can improve them.

Let's create the tables, fill them with sample data and make the improvements:

CREATE TABLE tbl_a
        (
        id1 NUMBER(10) NOT NULL,
        id2 NUMBER(10) NOT NULL,
        no VARCHAR(20) NOT NULL
        )
/
CREATE TABLE tbl_b
        (
        id1 NUMBER(10) NOT NULL,
        id2 NUMBER(10) NOT NULL,
        value VARCHAR(20) NOT NULL
        )
/
CREATE TABLE tbl_c
        (
        id1 NUMBER(10) NOT NULL,
        id2 NUMBER(10) NOT NULL,
        flag CHAR(1) NOT NULL,
        value VARCHAR(20) NOT NULL
        )
/
CREATE TABLE tbl_d
        (
        id1 NUMBER(10) NOT NULL,
        id2 NUMBER(10) NOT NULL,
        id3 NUMBER(10) NOT NULL,
        value VARCHAR(20) NOT NULL
        )
/
CREATE INDEX ix_b_1 ON tbl_b (id1)
/
CREATE INDEX ix_b_2 ON tbl_b (id2)
/
CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag))
/
CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag))
/
CREATE INDEX ix_d_13 ON tbl_d (id1, id3)
/
CREATE INDEX ix_d_23 ON tbl_d (id2, id3)
/
BEGIN
        DBMS_RANDOM.seed(20090715);
END;
/
INSERT
INTO    tbl_a (id1, id2, no)
SELECT  level, level + 100, 'No ' || level
FROM    dual
CONNECT BY
        level <= 100
/
INSERT
INTO    tbl_b (id1, id2, value)
SELECT  MOD(level, 10) + 1, MOD(level, 10) + 101, 'Value ' || level
FROM    dual
CONNECT BY
        level <= 1000000
/
INSERT
INTO    tbl_c (id1, id2, flag, value)
SELECT  MOD(level, 10) + 1, MOD(level, 10) + 101, DECODE(MOD(level, 100), 0, 'N', 1, 'y', 'Y'), 'Value ' || level
FROM    dual
CONNECT BY
        level <= 1000000
/
INSERT      
INTO    tbl_d (id1, id2, id3, value)
SELECT  MOD(level, 10) + 1, MOD(level, 10) + 101, MOD(level, 300), 'Value ' || level
FROM    dual
CONNECT BY
        level <= 1000000
/
COMMIT
/
&#91;/sourcecode&#93;

There are <strong>100</strong> rows in <code>tbl_a</code>, and <strong>1,000,000</strong> in <code>tbl_b</code>, <code>tbl_c</code> and <code>tbl_d</code>.

The field combinations we are joining the tables on are indexed:


CREATE INDEX ix_b_1 ON tbl_b (id1)
/
CREATE INDEX ix_b_2 ON tbl_b (id2)
/
CREATE INDEX ix_c_1flag ON tbl_c (id1, UPPER(flag))
/
CREATE INDEX ix_c_2flag ON tbl_c (id2, UPPER(flag))
/
CREATE INDEX ix_d_13 ON tbl_d (id1, id3)
/
CREATE INDEX ix_d_23 ON tbl_d (id2, id3)
/

Note the function based index on tbl_c (id2, UPPER(flag)).

Oracle optimizer cannot efficiently handle the OR conditions in both above queries.

That's why it users intermediate HASH JOINs (which yield billions of rows) and then tries to DISTINCT them using HASH UNIQUE.

The first query I had to cancel after 3hours :

SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_b b,
        tbl_a a ,
        tbl_c c,
        tbl_d d
WHERE   (
        b.id1 = a.id1
        AND a.id1 = c.id1
        AND UPPER(c.flag) = 'Y'
        AND c.id1 = d.id1
        )
        OR
        (
        b.id2 = a.id2
        AND a.id2 = c.id2
        AND UPPER(c.flag) = 'Y'
        AND c.id2 = d.id2
        )
        AND d.id3 = 10

SELECT STATEMENT 
 HASH UNIQUE
  CONCATENATION 
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_D
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_C
     TABLE ACCESS FULL, 20090715_or.TBL_B
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_D
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_C
     TABLE ACCESS FULL, 20090715_or.TBL_B

As you can see, it uses a stack of HASH JOINs (which produce a very large intermediate resultset), and then tries to shrink this resultset using HASH UNIQUE.

The second query does the same, but produces yet larger intermediate resultset, since it has to check for 7 possible combinations (instead of 2):

SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_c c
        WHERE   (a.id1 = c.id1 OR a.id2 = c.id2)
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_b b
        WHERE   b.id1 = a.id1 OR b.id2 = a.id2
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_d d
        WHERE   (a.id1 = d.id1 or a.id2 = d.id2)
                AND d.id3 = 10
        )

SELECT STATEMENT 
 HASH UNIQUE
  CONCATENATION 
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_B
    HASH JOIN 
     TABLE ACCESS FULL, 20090715_or.TBL_A
     MERGE JOIN CARTESIAN
      TABLE ACCESS FULL, 20090715_or.TBL_D
      BUFFER SORT
       TABLE ACCESS FULL, 20090715_or.TBL_C
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_D
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_B
     TABLE ACCESS FULL, 20090715_or.TBL_C
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_C
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_D
     TABLE ACCESS FULL, 20090715_or.TBL_B
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_D
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_C
     TABLE ACCESS FULL, 20090715_or.TBL_B
   HASH JOIN 
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_D
     TABLE ACCESS FULL, 20090715_or.TBL_C
    TABLE ACCESS FULL, 20090715_or.TBL_B
   HASH JOIN 
    TABLE ACCESS FULL, 20090715_or.TBL_D
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_C
     TABLE ACCESS FULL, 20090715_or.TBL_B
   HASH JOIN 
    HASH JOIN 
     HASH JOIN 
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_D
     TABLE ACCESS FULL, 20090715_or.TBL_C
    TABLE ACCESS FULL, 20090715_or.TBL_B

How can we improve this query?

To do this, we should get rid of all OR conditions inside the subqueries.

This can be done using UNION.

To rewrite the first query, we should find all rows from tbl_a that have matches on id1 in all three tables, then do the same for id2 and UNION the results. Since there is a single equijoin condition for each subquery, Oracle's optimizer can employ efficient SEMI JOIN algorithms, and since the tbl_a is small, merging the results will be almost instant:

SELECT  a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  1
        FROM    tbl_b b
        WHERE   b.id1 = a.id1
        )
        AND EXISTS
        (
        SELECT  1
        FROM    tbl_c c
        WHERE   c.id1 = a.id1
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  1
        FROM    tbl_d d
        WHERE   d.id1 = a.id1
                AND d.id3 = 10
        )
UNION
SELECT  a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  1
        FROM    tbl_b b
        WHERE   b.id2 = a.id2
        )
        AND EXISTS
        (
        SELECT  1
        FROM    tbl_c c
        WHERE   c.id2 = a.id2
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  1
        FROM    tbl_d d
        WHERE   d.id2 = a.id2
                AND d.id3 = 10
        )

NO	ID1	ID2
No 1	1	101
1 row fetched in 0.0002s (0.0085s)

SELECT STATEMENT 
 SORT UNIQUE
  UNION-ALL 
   NESTED LOOPS SEMI
    NESTED LOOPS SEMI
     HASH JOIN SEMI
      TABLE ACCESS FULL, 20090715_or.TBL_A
      INDEX SKIP SCAN, 20090715_or.IX_D_13
     INDEX RANGE SCAN, 20090715_or.IX_B_1
    INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG
   NESTED LOOPS SEMI
    NESTED LOOPS SEMI
     HASH JOIN SEMI
      TABLE ACCESS FULL, 20090715_or.TBL_A
      INDEX SKIP SCAN, 20090715_or.IX_D_23
     INDEX RANGE SCAN, 20090715_or.IX_B_2
    INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG

Each of two subqueries use a semijoin method on corresponding indexes, and the whole query completes in but 80 ms, which is almost instant.

Note that were there no indexes on the tables, this would still be quite efficient, since in this case HASH SEMI operations wouldn't increase the intermediate resultset but decrease it:

SELECT  /*+ FULL(a) */
        a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  /*+ FULL(b) */
                1
        FROM    tbl_b b
        WHERE   b.id1 = a.id1
        )
        AND EXISTS
        (
        SELECT  /*+ FULL(c) */
                1
        FROM    tbl_c c
        WHERE   c.id1 = a.id1
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  /*+ FULL(d) */
                1
        FROM    tbl_d d
        WHERE   d.id1 = a.id1
                AND d.id3 = 10
        )
UNION
SELECT  /*+ FULL(a) */
        a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  /*+ FULL(b) */
                1
        FROM    tbl_b b
        WHERE   b.id2 = a.id2
        )
        AND EXISTS
        (
        SELECT  /*+ FULL(c) */
                1
        FROM    tbl_c c
        WHERE   c.id2 = a.id2
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  /*+ FULL(d) */
                1
        FROM    tbl_d d
        WHERE   d.id2 = a.id2
                AND d.id3 = 10
        )

NO	ID1	ID2
No 1	1	101
1 row fetched in 0.0002s (1.5625s)

SELECT STATEMENT 
 SORT UNIQUE
  UNION-ALL 
   HASH JOIN SEMI
    HASH JOIN SEMI
     HASH JOIN SEMI
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_D
     TABLE ACCESS FULL, 20090715_or.TBL_B
    TABLE ACCESS FULL, 20090715_or.TBL_C
   HASH JOIN SEMI
    HASH JOIN SEMI
     HASH JOIN SEMI
      TABLE ACCESS FULL, 20090715_or.TBL_A
      TABLE ACCESS FULL, 20090715_or.TBL_D
     TABLE ACCESS FULL, 20090715_or.TBL_B
    TABLE ACCESS FULL, 20090715_or.TBL_C

I added hints to disable indexes, but the query still completes in 1.56 seconds

Now, let's try to emulate the second query, which allows matching different tables on different ids.

To do this, we can employ a simple filtering technique.

Since all our tables are properly indexed, we can just use a logical combination of different EXISTS conditions (as opposed to pushing the logical conditions into EXISTS predicates themselves).

Each single EXISTS condition requires but one index scan operation, and the whole query, therefore, will take but 1200 index lookups, which is nothing.

SELECT  a.no,
        a.id1,
        a.id2
FROM    "20090715_or".tbl_a a
WHERE   (
                EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_b b
                WHERE   b.id1 = a.id1
                )
                OR EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_b b
                WHERE   b.id2 = a.id2
                )
        )
        AND
        (
                EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_c c
                WHERE   c.id1 = a.id1
                        AND UPPER(c.flag) = 'Y'
                )
                OR EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_c c
                WHERE   c.id2 = a.id2
                        AND UPPER(c.flag) = 'Y'
                )
        )
        AND
        (
                EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_d d
                WHERE   d.id2 = a.id2
                        AND d.id3 = 10
                )
                OR
                EXISTS
                (
                SELECT  1
                FROM    "20090715_or".tbl_d d
                WHERE   d.id2 = a.id2
                        AND d.id3 = 10
                )
        )

NO	ID1	ID2
No 1	1	101
1 row fetched in 0.0002s (0.0018s)

SELECT STATEMENT 
 FILTER 
  TABLE ACCESS FULL, 20090715_or.TBL_A
  INDEX RANGE SCAN, 20090715_or.IX_B_1
   INDEX RANGE SCAN, 20090715_or.IX_B_2
    INDEX RANGE SCAN, 20090715_or.IX_C_1FLAG
     INDEX RANGE SCAN, 20090715_or.IX_C_2FLAG
      INDEX RANGE SCAN, 20090715_or.IX_D_23
       INDEX RANGE SCAN, 20090715_or.IX_D_23

18 ms, which is literally nothing.

This solution can be applied to the first query as well, and probably will be as efficient in the first case too.

I just wanted to demonstrate that transforming an OR query to a UNION of AND queries can help the optimizer to choose right joining scenario for each separate query and boost overall query performance.

Written by Quassnoi

July 15th, 2009 at 11:00 pm

Posted in Oracle

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

EXPLAIN EXTENDED

Oracle: OR on multiple EXISTS clauses

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow