EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Genealogy: finding a descendant

In the previous article I described an efficient way to build a genealogy with a hierarchical query in Oracle:

Now, I’d like to describe a query to check if one entity in the table is a descendant on another one.

Let’s create a sample table:

CREATE TABLE animal (
        animal_id INTEGER NOT NULL,
        father INTEGER NOT NULL,
        mother INTEGER NOT NULL,
        name VARCHAR2(50) NOT NULL
)
/
ALTER TABLE animal
ADD CONSTRAINT pk_animal_id
PRIMARY KEY (animal_id)
/
CREATE INDEX ix_animal_father
ON animal (father)
/
CREATE INDEX ix_animal_mother
ON animal (mother)
/
CALL DBMS_RANDOM.SEED(20090525)
/
INSERT
INTO    animal (animal_id, father, mother, name)
SELECT  lvl,
        CASE WHEN father <= 0 OR father >= lvl THEN 0 ELSE father END,
        CASE WHEN mother <= 0 OR mother >= lvl THEN 0 ELSE mother END,
        'Animal ' || lvl
FROM    (
        SELECT  level AS lvl,
                level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 2 AS father,
                level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 1 AS mother
        FROM    dual
        CONNECT BY
                level <= 10000
        )
/
COMMIT
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090525_genealogy"');
END;
/

As in previous article, this table has 10,000 rows, and for each animal both parents are randomly chosen from 100 previous ones.

Male animals have even id‘s, female ones have odd ones.

We had a query to select all descendants for a given animal. To find out if an animal with a given animal_id is a descendant, we can just add a simple WHERE filter:

SELECT  DISTINCT(animal_id)
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
WHERE   animal_id = 9999
START WITH
        animal_id = 9000
CONNECT BY
        parent = PRIOR animal_id
ANIMAL_ID
9999
1 row fetched in 0.0001s (9.7809s)
SELECT STATEMENT
 HASH UNIQUE
  FILTER
   CONNECT BY WITH FILTERING
    FILTER
     COUNT
      VIEW
       UNION-ALL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    HASH JOIN
     CONNECT BY PUMP
     COUNT
      VIEW
       UNION-ALL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    COUNT
     VIEW
      UNION-ALL
       TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
       TABLE ACCESS FULL, 20090525_genealogy.ANIMAL

This query works fine for younger animals, but its time grows exponentially with the animal’s age.

In the example above it takes 12 seconds, and for a animal of the first 100 rows it would take like forever.

This task in fact is path finding problem. There are efficient algorithms to do that (like Dijkstra’s algorithm), but they are hardly feasible to implement with SQL.

For instance, the query above just goes over all descendancy paths, selects all animals it meets on its way and then uses HASH UNIQUE to get rid of the duplicates, which, of course, takes exponential time.

But it’s possible to improve this query.

Here are four steps to do this:

  1. Constructing a UNION ALL each time by browsing the original table is quite expensive. It’s better to materialize the table so that only required fields are stored.
  2. To find out if a given animal is a descendant of another one, it’s not needed to go over all descendants. It’s enough to find the first path leading to this descendant and stop. This will reduce the query time greatly.
  3. We need to cut the paths that lead to the animals younger that the one we are searching for: they cannot be its parents.
  4. We need to select longer paths first, as they are more likely to lead to the descendant sooner.

Fortunately, all these steps are possible to implement in Oracle:

WITH parents AS
        (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
SELECT  *
FROM    (
        SELECT  p.*, level
        FROM    parents p
        WHERE   p.animal_id = 9500
        START WITH
                p.animal_id = 20
        CONNECT BY
                p.parent = PRIOR p.animal_id
                AND p.parent < 9500
        ORDER SIBLINGS BY
                animal_id - parent DESC
        )
WHERE   rownum = 1
GENDER ANIMAL_ID PARENT LEVEL
1 9500 9463 152
1 row fetched in 0.0002s (0.6094s)
SELECT STATEMENT
 TEMP TABLE TRANSFORMATION
  LOAD AS SELECT
   UNION-ALL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
  COUNT STOPKEY
   VIEW
    FILTER
     CONNECT BY WITH FILTERING
      FILTER
       COUNT
        VIEW
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986
      HASH JOIN
       CONNECT BY PUMP
       COUNT
        VIEW
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986
      COUNT
       VIEW
        TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986

This query runs for only 0.6 seconds, and finds a very long ancestry chain between a very old and very young animal.

This needs a little explanation:

  1. To materialize the query, we used a WITH clause.

    Usually, materializing requires an undocumented hint MATERIALIZE, but for hierarchical queries the CBO is smart enough to materialize the query without this hint.

  2. To stop on the first path, we used the subquery along with ROWNUM = 1 which gave use a COUNT STOPKEY in the plan and stopped scanning as soon as it found the first path.
  3. To cut the wrong paths, we added the condition AND p.parent < 9500 to the CONNECT BY clause. This conditions stops traversing the hierarchy as soon as it reaches the maximal possible age for the parent (a parent cannot be younger than a child).
  4. Finally, we added ORDER SIBLINGS BY animal_id - parent DESC. This condition chooses order in which the siblings are chosen for traversal (and therefore returned). Of all possible sibings, the engine selects the ones born latest, thus making the engine to evaluate the shorter paths first, which reduces the number of iterations.

We can even improve this query to find out if several animals are descendants of a given one:

WITH parents AS
        (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
SELECT  *
FROM    animal a
WHERE   animal_id BETWEEN 9500 AND 9510
        AND EXISTS
        (
        SELECT  1
        FROM    parents p
        WHERE   p.animal_id = a.animal_id
        START WITH
                p.animal_id = 20
        CONNECT BY
                p.parent = PRIOR p.animal_id
                AND p.parent < a.animal_id
        )
ANIMAL_ID FATHER MOTHER NAME
9500 9400 9463 Animal 9500
9501 9472 9415 Animal 9501
9502 9472 9481 Animal 9502
9503 9476 9481 Animal 9503
9504 9422 9451 Animal 9504
9505 9472 9499 Animal 9505
9506 9432 9447 Animal 9506
9507 9468 9477 Animal 9507
9508 9448 9501 Animal 9508
9509 9410 9487 Animal 9509
9510 9442 9473 Animal 9510
11 rows fetched in 0.0005s (6.0466s)
SELECT STATEMENT
 TEMP TABLE TRANSFORMATION
  LOAD AS SELECT , 20090525_genealogy.ANIMAL
   UNION-ALL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
  TABLE ACCESS BY INDEX ROWID, 20090525_genealogy.ANIMAL
   INDEX RANGE SCAN, 20090525_genealogy.PK_ANIMAL_ID
    FILTER
     CONNECT BY WITH FILTERING
      FILTER
       COUNT
        VIEW
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986
      HASH JOIN
       CONNECT BY PUMP
       COUNT
        VIEW
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986
      COUNT
       VIEW
        TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986

, which also works quite fast (6 seconds).

In this query, however, we had to omit last step (ORDER SIBLINGS BY), as Oracle does not allow ORDER BY clause in correlated subqueries.

The query is still quite efficient.

Written by Quassnoi

May 25th, 2009 at 11:00 pm

Posted in Oracle

Comments are closed.