EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Adjacency list vs. nested sets: Oracle

Comments enabled. I *really* need your comment

Continuing the series:

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Today is Oracle time.

Adjacent sets require recursion, and Oracle has a native way to implement recursion in SQL statements. To do this it uses a special clause, CONNECT BY.

Unlike other systems, Oracle always supported recursion in SQL statements since version 2 (this is the name the first version of Oracle was released under).

This was done because by that time there already were several hierarchical databases on the market which are quite good for storing hierarchical data. If not for this clause, transition from a hierarchical DBMS to the relational DBMS would be too hard.

Oracle's way to query self-relations is somewhat different from recursive CTE's that other systems use:

  • Recursive CTEs can use an arbitrary set in recursive steps. The recursion base and recursion stack are not visible in further steps. The recursive operation is a set operation (usually a JOIN) on the recursion parameter (a set) and the result of the previous operation:

    Recursive CTE

  • CONNECT BY queries can only use the value of a single row in recursive steps.

    The recursion base and recursion stack are visible, the values outside the recursion stack are not.

    The recursive operation is a set operation on each row of the current row and the base rowset.

    Unlike recursive CTE, CONNECT BY operates row-wise, that is each row of the result spans its own recursion branch with its own stack:

    CONNECT BY

The main difference it that the CONNECT BY operation cannot produce anything that was not in the original rowset, while a recursive CTE can.

However, CONNECT BY has numerous benefits too: it's very fast, allows to track the hierarchy path easily (using a built-in funtion named SYS_CONNECT_BY_PATH) and implicitly sorts the recordset in tree order (with additional clause ORDER SIBLINGS BY to sort the siblings).

This method, of course, allows traversing adjacency lists easily.

To compare efficiency of adjacency lists and that of nested sets, let's create a sample table:

Table creation details

The table contains both adjacency list data and nested sets data, with 8 levels of nesting, 5 children of each parent node and 2,441,405 records.

As in the previous articles, we'll test performance of the three most used queries:

  • Find all descendants of a given node
  • Find all ancestors of a given node
  • Find all descendants of a given node up to a certain depth

All descendants

Nested sets

SELECT  SUM(LENGTH(hc.stuffing))
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = 42

View query details

This is super fast, faster than in any system before that: only 10 ms. The plan is quite predictable: a UNIQUE scan to find the parent record and a range scan to find all children.

Adjacency list

SELECT  SUM(LENGTH(stuffing))
FROM    t_hierarchy
START WITH
        id = 42
CONNECT BY
        parent = PRIOR id

View query details

This is quite fast too: 140 ms. The plan loops over the parent ranges for each id.

Note the TABLE ACCESS FULL in the end of the plan. Oracle reserves a right to do a FULL SCAN instead of a RANGE SCAN if it considers a certain parent range too large to use the index. However, this is recursion, not a single operation, and it's impossible to tell in advance which id values will be returned on each recursion step. That's why Oracle shows both methods, just in case, and chooses the most efficient method in runtime.

In our table, at most 5 rows share a parent (of more then 2 million), so it's very unlikely this plan branch will be ever chosen.

Find all ancestors

Nested sets

SELECT  hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM    t_hierarchy hc
JOIN    t_hierarchy hp
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hc.id = 1000000
ORDER BY
        hp.lft

View query details

The plan used is exactly the same as for the query to choose all descendants.

However, the range on lft is too broad and the query works for more than 3 seconds.

Adjacency list

SELECT  id, parent, lft, rgt, data, level
FROM    t_hierarchy
START WITH
        id = 1000000
CONNECT BY
        id = PRIOR parent
ORDER BY
        level DESC

View query details

This query completes in less than 1 ms, which is instant.

Note that unlike recursive CTE's, Oracle supplies a built-in preudocolumn, level, that returns the current recursion level. This comes handy in cases like this.

Descendants up to a given level

Nested sets

SELECT  hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = ?
        AND
        (
        SELECT  COUNT(*)
        FROM    t_hierarchy hn
        WHERE   hc.lft BETWEEN hn.lft AND hn.rgt
                AND hn.lft BETWEEN hp.lft AND hp.rgt
        ) <= 3

View the query for node 42

View the query for node 31,415

As in other systems, performance heavily depends on how many descendants does the record have.

For the node 42 which is close to the root node and has lots of descendants, the query runs for 122 seconds. For the node 31,415, the query takes 1.5 ms (next to instant).

Adjacency list

SELECT  id, parent, lft, rgt, data
FROM    t_hierarchy
START WITH
        id = ?
CONNECT BY
        parent = PRIOR id
        AND level <= 3

Limiting the recursion depth query is very simple in Oracle: we just add the additional condition to CONNECT BY clause constraining the limit.

View the query for node 42

View the query for node 31,415

Both queries complete in 1 ms, that is instant.

Summary

Oracle implements a special clause, CONNECT BY, that makes traversing adjacency lists in general and parent-child relationships in particular very easy, since this is exactly what this clause was intended for.

Except for finding all descendants of a single node (without finding out or limiting the depth), nested sets model is less efficient in Oracle than adjacency lists.

Adjacency lists are not only more efficient but the queries using them are more elegant and easy to maintain due to the native support.

That's why adjacency list model should be used instead of nested sets model in Oracle too, as well as in SQL Server and PostgreSQL 8.4.

Written by Quassnoi

September 28th, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply