EXPLAIN EXTENDED

How to create fast database queries

Genealogy: finding a descendant

Comments enabled. I *really* need your comment

In the previous article I described an efficient way to build a genealogy with a hierarchical query in Oracle:

Now, I'd like to describe a query to check if one entity in the table is a descendant on another one.

Let's create a sample table:

CREATE TABLE animal (
        animal_id INTEGER NOT NULL,
        father INTEGER NOT NULL,
        mother INTEGER NOT NULL,
        name VARCHAR2(50) NOT NULL
)
/
ALTER TABLE animal
ADD CONSTRAINT pk_animal_id
PRIMARY KEY (animal_id)
/
CREATE INDEX ix_animal_father
ON animal (father)
/
CREATE INDEX ix_animal_mother
ON animal (mother)
/
CALL DBMS_RANDOM.SEED(20090525)
/
INSERT
INTO    animal (animal_id, father, mother, name)
SELECT  lvl,
        CASE WHEN father <= 0 OR father >= lvl THEN 0 ELSE father END,
        CASE WHEN mother <= 0 OR mother >= lvl THEN 0 ELSE mother END,
        'Animal ' || lvl
FROM    (
        SELECT  level AS lvl,
                level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 2 AS father,
                level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 1 AS mother
        FROM    dual
        CONNECT BY
                level <= 10000
        )
/
COMMIT
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090525_genealogy"');
END;
/
&#91;/sourcecode&#93;

As in previous article, this table has <strong>10,000</strong> rows, and for each animal both parents are randomly chosen from <strong>100</strong> previous ones.

Male animals have even <code>id</code>'s, female ones have odd ones.

We had a query to select all descendants for a given animal. To find out if an animal with a given <code>animal_id</code> is a descendant, we can just add a simple <code>WHERE</code> filter:


SELECT  DISTINCT(animal_id)
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
WHERE   animal_id = 9999
START WITH
        animal_id = 9000
CONNECT BY
        parent = PRIOR animal_id

ANIMAL_ID
9999
1 row fetched in 0.0001s (9.7809s)
SELECT STATEMENT 
 HASH UNIQUE
  FILTER 
   CONNECT BY WITH FILTERING
    FILTER 
     COUNT 
      VIEW 
       UNION-ALL 
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    HASH JOIN 
     CONNECT BY PUMP 
     COUNT 
      VIEW 
       UNION-ALL 
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
        TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    COUNT 
     VIEW 
      UNION-ALL 
       TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
       TABLE ACCESS FULL, 20090525_genealogy.ANIMAL

This query works fine for younger animals, but its time grows exponentially with the animal's age.

In the example above it takes 12 seconds, and for a animal of the first 100 rows it would take like forever.

This task in fact is path finding problem. There are efficient algorithms to do that (like Dijkstra's algorithm), but they are hardly feasible to implement with SQL.

For instance, the query above just goes over all descendancy paths, selects all animals it meets on its way and then uses HASH UNIQUE to get rid of the duplicates, which, of course, takes exponential time.

But it's possible to improve this query.

Here are four steps to do this:

  1. Constructing a UNION ALL each time by browsing the original table is quite expensive. It's better to materialize the table so that only required fields are stored.
  2. To find out if a given animal is a descendant of another one, it's not needed to go over all descendants. It's enough to find the first path leading to this descendant and stop. This will reduce the query time greatly.
  3. We need to cut the paths that lead to the animals younger that the one we are searching for: they cannot be its parents.
  4. We need to select longer paths first, as they are more likely to lead to the descendant sooner.

Fortunately, all these steps are possible to implement in Oracle:

WITH parents AS
(
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
SELECT *
FROM (
SELECT p.*, level
FROM parents p
WHERE p.animal_id = 9500
START WITH
p.animal_id = 20
CONNECT BY
p.parent = PRIOR p.animal_id
AND p.parent < 9500 ORDER SIBLINGS BY animal_id - parent DESC ) WHERE rownum = 1 [/sourcecode]

GENDER ANIMAL_ID PARENT LEVEL
1 9500 9463 152
1 row fetched in 0.0002s (0.6094s)
SELECT STATEMENT 
 TEMP TABLE TRANSFORMATION 
  LOAD AS SELECT 
   UNION-ALL 
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
  COUNT STOPKEY
   VIEW 
    FILTER 
     CONNECT BY WITH FILTERING
      FILTER 
       COUNT 
        VIEW 
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986
      HASH JOIN 
       CONNECT BY PUMP 
       COUNT 
        VIEW 
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986
      COUNT 
       VIEW 
        TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986

This query runs for only 0.6 seconds, and finds a very long ancestry chain between a very old and very young animal.

This needs a little explanation:

  1. To materialize the query, we used a WITH clause.

    Usually, materializing requires an undocumented hint MATERIALIZE, but for hierarchical queries the CBO is smart enough to materialize the query without this hint.

  2. To stop on the first path, we used the subquery along with ROWNUM = 1 which gave use a COUNT STOPKEY in the plan and stopped scanning as soon as it found the first path.
  3. To cut the wrong paths, we added the condition AND p.parent < 9500 to the CONNECT BY clause. This conditions stops traversing the hierarchy as soon as it reaches the maximal possible age for the parent (a parent cannot be younger than a child).
  4. Finally, we added ORDER SIBLINGS BY animal_id - parent DESC. This condition chooses order in which the siblings are chosen for traversal (and therefore returned). Of all possible sibings, the engine selects the ones born latest, thus making the engine to evaluate the shorter paths first, which reduces the number of iterations.

We can even improve this query to find out if several animals are descendants of a given one:

WITH parents AS
(
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
SELECT *
FROM animal a
WHERE animal_id BETWEEN 9500 AND 9510
AND EXISTS
(
SELECT 1
FROM parents p
WHERE p.animal_id = a.animal_id
START WITH
p.animal_id = 20
CONNECT BY
p.parent = PRIOR p.animal_id
AND p.parent < a.animal_id ) [/sourcecode]

ANIMAL_ID FATHER MOTHER NAME
9500 9400 9463 Animal 9500
9501 9472 9415 Animal 9501
9502 9472 9481 Animal 9502
9503 9476 9481 Animal 9503
9504 9422 9451 Animal 9504
9505 9472 9499 Animal 9505
9506 9432 9447 Animal 9506
9507 9468 9477 Animal 9507
9508 9448 9501 Animal 9508
9509 9410 9487 Animal 9509
9510 9442 9473 Animal 9510
11 rows fetched in 0.0005s (6.0466s)
SELECT STATEMENT 
 TEMP TABLE TRANSFORMATION 
  LOAD AS SELECT , 20090525_genealogy.ANIMAL
   UNION-ALL 
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
    TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
  TABLE ACCESS BY INDEX ROWID, 20090525_genealogy.ANIMAL
   INDEX RANGE SCAN, 20090525_genealogy.PK_ANIMAL_ID
    FILTER 
     CONNECT BY WITH FILTERING
      FILTER 
       COUNT 
        VIEW 
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986
      HASH JOIN 
       CONNECT BY PUMP 
       COUNT 
        VIEW 
         TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986
      COUNT 
       VIEW 
        TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986

, which also works quite fast (6 seconds).

In this query, however, we had to omit last step (ORDER SIBLINGS BY), as Oracle does not allow ORDER BY clause in correlated subqueries.

The query is still quite efficient.

Written by Quassnoi

May 25th, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply