EXPLAIN EXTENDED

How to create fast database queries

Genealogy query on both parents

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm trying to fetch a genealogy tree of animals from my Oracle database.

Here's the table:

Animal_ID Parent_Male_ID Parent_Female_ID

If I specify an animal, I can get all of its descendants (on the male side) using something like this:

SELECT  *
FROM    animal
START WITH
animal_id = 123
CONNECT BY
PRIOR animal_id = parent_male_id

I'm trying to find a way to extend this in such a way that if I specify an animal, it will fetch both parents and then will fetch all of their descendants.

Any thoughts?

This is possible to do with a same CONNECT BY query by providing a little more comlex condition for START WITH and CONNECT BY:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
animal_id IN
(
SELECT  father
FROM    animal
WHERE   animal_id = 9500
UNION ALL
SELECT  mother
FROM    animal
WHERE   animal_id = 9500
)
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id

However, my gut feeling says this query need performance checking.

Let's create a sample table:

CREATE TABLE animal (
animal_id INTEGER NOT NULL,
father INTEGER NOT NULL,
mother INTEGER NOT NULL,
name VARCHAR2(50) NOT NULL
)
/
ALTER TABLE animal
ADD CONSTRAINT pk_animal_id
PRIMARY KEY (animal_id)
/
CREATE INDEX ix_animal_father
ON animal (father)
/
CREATE INDEX ix_animal_mother
ON animal (mother)
/
CALL DBMS_RANDOM.SEED(20090524)
/
INSERT
INTO    animal (animal_id, father, mother, name)
SELECT  lvl,
CASE WHEN father <= 0 OR father >= lvl THEN 0 ELSE father END,
CASE WHEN mother <= 0 OR mother >= lvl THEN 0 ELSE mother END,
'Animal ' || lvl
FROM    (
SELECT  level AS lvl,
level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 2 AS father,
level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 1 AS mother
FROM    dual
CONNECT BY
level <= 10000
        )
/
COMMIT
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090524_genealogy"');
END;
/

This table is filled with 10,000 rows.

Males have even id's, females have odd ones.

For each animal, both parents are randomly chosen from 100 preceding species (taking gender into account, of course).

Both father and mother fields are NOT NULL, a fake id of zero is used to mark unknown parent (so that the indexes can be always used).

Such an algorithm, of course, makes incest and promiscuity common for these animals, so that multiple ancestry chains can lead from one animal to another.

Now, let's run the query:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
animal_id IN
(
SELECT  father
FROM    animal
WHERE   animal_id = 9500
UNION ALL
SELECT  mother
FROM    animal
WHERE   animal_id = 9500
)
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id
ANIMAL_ID
9409
9432
9435
9439
9453
9462
9465
9473
9493
9500
9998
9999
10000
424 rows fetched in 0.0156s (98.0431s)
SELECT STATEMENT 
 SORT UNIQUE
  CONNECT BY WITHOUT FILTERING
   TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
    NESTED LOOPS 
     VIEW , SYS.VW_NSO_1
      UNION-ALL 
       TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
        INDEX UNIQUE SCAN, 20090524_genealogy.PK_ANIMAL_ID
       TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
        INDEX UNIQUE SCAN, 20090524_genealogy.PK_ANIMAL_ID
     INDEX UNIQUE SCAN, 20090524_genealogy.PK_ANIMAL_ID
   TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
   UNION-ALL 
    TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
     INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_FATHER
    TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
     INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_MOTHER

This query works for 98 seconds, and this is for animal with id of 9500, that is one of the youngest ones.

For elder animals, this query will take literally forever.

But why so long?

This query does not look very complex. It seems that on each iteration there should be an index lookup which finds a appropriate set of animals that have PRIOR animal_id either as a father or as a mother and further traverses the tree.

However, Oracle is not so smart.

When building the query plan, it need to choose an index it will then use for joining.

As any given animal can be either male or female, the different index will be used in each case. In case of a male species, the engine needs to use index on father to select all of its breed, and the index on mother should be used to select all children of a female species.

More than that: the index should be chosen again on each iteration.

Oracle can use a single index to build a hierarchial query, but such a complex task is beyond CBO's abilities.

That's why the CBO choses a good old TABLE ACCESS FULL to traverse the hierarchy tree, though it manages to pick up the indexes to select the animal_id for the START WITH condition.

Now, what can we do with this?

Since Oracle can only use a single index, we need to provide it somehow with this single index.

This is impossible to do on the table layout like this, since there can be at most one ROWID per index. Oracle does not allow creation of indexes with multiple key values for one row.

Oracle also does not allow creating indexes on non-materialized views.

However, we can trick Oracle a little and force it to create the index on the fly.

If we look into the query plan above, we will see that NESTED LOOPS along with TABLE ACCESS FULL are used to iterate the hierarchy, though a hash lookup would be more efficient here.

But hash lookups are possible only on equi-joins, and we have an IN condition. Can we turn it to an equi-join somehow?

Since we need an equality condition here, we need two separate rows to be returned for each condition in the IN clause. This can be achieved by selecting father's and mother's apart and UNION'ing them.

For each animal we will get two rows: first one with its father as a parent, second one with its mother. Then we can just use the common condition parent = PRIOR id.

On each iteration step, two rows will satisfy it now instead of one, which will lead to more iterations.

But since a HASH JOIN will be used now, this will perform much better.

Let's check it out:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
SELECT  0 AS gender, animal_id, father AS parent
FROM    animal
UNION ALL
SELECT  1, animal_id, mother
FROM    animal
)
START WITH
animal_id IN
(
SELECT  father
FROM    animal
WHERE   animal_id = 9500
UNION ALL
SELECT  mother
FROM    animal
WHERE   animal_id = 9500
)
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id

ANIMAL_ID
9409
9432
9435
9439
9453
9462
9465
9473
9493
9500
9998
9999
10000
424 rows fetched in 0.0077s (0.4217s)
SELECT STATEMENT 
 SORT UNIQUE
  CONNECT BY WITH FILTERING
   FILTER 
    COUNT 
     VIEW 
      UNION-ALL 
       TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
       TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
    UNION-ALL 
     TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
      INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_FATHER
     TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
      INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_MOTHER
   HASH JOIN 
    CONNECT BY PUMP 
    COUNT 
     VIEW 
      UNION-ALL 
       TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
       TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
   COUNT 
    VIEW 
     UNION-ALL 
      TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
      TABLE ACCESS FULL, 20090524_genealogy.ANIMAL
   UNION-ALL 
    TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
     INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_FATHER
    TABLE ACCESS BY INDEX ROWID, 20090524_genealogy.ANIMAL
     INDEX RANGE SCAN, 20090524_genealogy.IX_ANIMAL_MOTHER

Now the query takes only 0.42 seconds.

Written by Quassnoi

May 24th, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply