Genealogy: finding a descendant
Comments enabled. I *really* need your comment
In the previous article I described an efficient way to build a genealogy with a hierarchical query in Oracle:
Now, I'd like to describe a query to check if one entity in the table is a descendant on another one.
Let's create a sample table:
CREATE TABLE animal ( animal_id INTEGER NOT NULL, father INTEGER NOT NULL, mother INTEGER NOT NULL, name VARCHAR2(50) NOT NULL ) / ALTER TABLE animal ADD CONSTRAINT pk_animal_id PRIMARY KEY (animal_id) / CREATE INDEX ix_animal_father ON animal (father) / CREATE INDEX ix_animal_mother ON animal (mother) / CALL DBMS_RANDOM.SEED(20090525) / INSERT INTO animal (animal_id, father, mother, name) SELECT lvl, CASE WHEN father <= 0 OR father >= lvl THEN 0 ELSE father END, CASE WHEN mother <= 0 OR mother >= lvl THEN 0 ELSE mother END, 'Animal ' || lvl FROM ( SELECT level AS lvl, level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 2 AS father, level - MOD(level, 2) - FLOOR(DBMS_RANDOM.VALUE() * 50) * 2 - 1 AS mother FROM dual CONNECT BY level <= 10000 ) / COMMIT / BEGIN DBMS_STATS.gather_schema_stats('"20090525_genealogy"'); END; / [/sourcecode] As in previous article, this table has <strong>10,000</strong> rows, and for each animal both parents are randomly chosen from <strong>100</strong> previous ones. Male animals have even <code>id</code>'s, female ones have odd ones. We had a query to select all descendants for a given animal. To find out if an animal with a given <code>animal_id</code> is a descendant, we can just add a simple <code>WHERE</code> filter: SELECT DISTINCT(animal_id) FROM ( SELECT 0 AS gender, animal_id, father AS parent FROM animal UNION ALL SELECT 1, animal_id, mother FROM animal ) WHERE animal_id = 9999 START WITH animal_id = 9000 CONNECT BY parent = PRIOR animal_id
ANIMAL_ID | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (9.7809s) |
SELECT STATEMENT HASH UNIQUE FILTER CONNECT BY WITH FILTERING FILTER COUNT VIEW UNION-ALL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL HASH JOIN CONNECT BY PUMP COUNT VIEW UNION-ALL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL COUNT VIEW UNION-ALL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL
This query works fine for younger animals, but its time grows exponentially with the animal's age.
In the example above it takes 12 seconds, and for a animal of the first 100 rows it would take like forever.
This task in fact is path finding problem. There are efficient algorithms to do that (like Dijkstra's algorithm), but they are hardly feasible to implement with SQL
.
For instance, the query above just goes over all descendancy paths, selects all animals it meets on its way and then uses HASH UNIQUE
to get rid of the duplicates, which, of course, takes exponential time.
But it's possible to improve this query.
Here are four steps to do this:
- Constructing a
UNION ALL
each time by browsing the original table is quite expensive. It's better to materialize the table so that only required fields are stored. - To find out if a given animal is a descendant of another one, it's not needed to go over all descendants. It's enough to find the first path leading to this descendant and stop. This will reduce the query time greatly.
- We need to cut the paths that lead to the animals younger that the one we are searching for: they cannot be its parents.
- We need to select longer paths first, as they are more likely to lead to the descendant sooner.
Fortunately, all these steps are possible to implement in Oracle:
WITH parents AS
(
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
SELECT *
FROM (
SELECT p.*, level
FROM parents p
WHERE p.animal_id = 9500
START WITH
p.animal_id = 20
CONNECT BY
p.parent = PRIOR p.animal_id
AND p.parent < 9500
ORDER SIBLINGS BY
animal_id - parent DESC
)
WHERE rownum = 1
[/sourcecode]
GENDER | ANIMAL_ID | PARENT | LEVEL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 9500 | 9463 | 152 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.6094s) |
SELECT STATEMENT TEMP TABLE TRANSFORMATION LOAD AS SELECT UNION-ALL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL COUNT STOPKEY VIEW FILTER CONNECT BY WITH FILTERING FILTER COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986 HASH JOIN CONNECT BY PUMP COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986 COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6668_7C7986
This query runs for only 0.6 seconds, and finds a very long ancestry chain between a very old and very young animal.
This needs a little explanation:
- To materialize the query, we used a
WITH
clause.Usually, materializing requires an undocumented hint
MATERIALIZE
, but for hierarchical queries the CBO is smart enough to materialize the query without this hint. - To stop on the first path, we used the subquery along with ROWNUM = 1 which gave use a
COUNT STOPKEY
in the plan and stopped scanning as soon as it found the first path. - To cut the wrong paths, we added the condition
AND p.parent < 9500
to theCONNECT BY
clause. This conditions stops traversing the hierarchy as soon as it reaches the maximal possible age for the parent (a parent cannot be younger than a child). - Finally, we added
ORDER SIBLINGS BY animal_id - parent DESC
. This condition chooses order in which the siblings are chosen for traversal (and therefore returned). Of all possible sibings, the engine selects the ones born latest, thus making the engine to evaluate the shorter paths first, which reduces the number of iterations.
We can even improve this query to find out if several animals are descendants of a given one:
WITH parents AS
(
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
SELECT *
FROM animal a
WHERE animal_id BETWEEN 9500 AND 9510
AND EXISTS
(
SELECT 1
FROM parents p
WHERE p.animal_id = a.animal_id
START WITH
p.animal_id = 20
CONNECT BY
p.parent = PRIOR p.animal_id
AND p.parent < a.animal_id
)
[/sourcecode]
ANIMAL_ID | FATHER | MOTHER | NAME | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9500 | 9400 | 9463 | Animal 9500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9501 | 9472 | 9415 | Animal 9501 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9502 | 9472 | 9481 | Animal 9502 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9503 | 9476 | 9481 | Animal 9503 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9504 | 9422 | 9451 | Animal 9504 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9505 | 9472 | 9499 | Animal 9505 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9506 | 9432 | 9447 | Animal 9506 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9507 | 9468 | 9477 | Animal 9507 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9508 | 9448 | 9501 | Animal 9508 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9509 | 9410 | 9487 | Animal 9509 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9510 | 9442 | 9473 | Animal 9510 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 rows fetched in 0.0005s (6.0466s) |
SELECT STATEMENT TEMP TABLE TRANSFORMATION LOAD AS SELECT , 20090525_genealogy.ANIMAL UNION-ALL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS FULL, 20090525_genealogy.ANIMAL TABLE ACCESS BY INDEX ROWID, 20090525_genealogy.ANIMAL INDEX RANGE SCAN, 20090525_genealogy.PK_ANIMAL_ID FILTER CONNECT BY WITH FILTERING FILTER COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986 HASH JOIN CONNECT BY PUMP COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986 COUNT VIEW TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6677_7C7986
, which also works quite fast (6 seconds).
In this query, however, we had to omit last step (ORDER SIBLINGS BY
), as Oracle does not allow ORDER BY
clause in correlated subqueries.
The query is still quite efficient.