EXPLAIN EXTENDED

How to create fast database queries

Adjacency list vs. nested sets: PostgreSQL

with 4 comments

This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

First, let's explain what all this means.

Adjacency list

Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N transitive relations between entities of same domain.

For instance, ancestor-descendant relation is:

  • Transitive:
    • If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
  • Irreflexive:
    • If A is an ancestor of B, then B is never an ancestor of A
  • 0-1:0-N
    • A can have zero, one or many children. A can have zero or one parents.

These relations can be represented by an ordered directed tree.

Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.

Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.

This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called adjacency list.

Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model and adjacency list model have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.

Now, since we have a tree here which implies 0-1:0-N relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY reference to the table itself.

Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.

This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.

SQL, as we all know, deals with relational tables which can be transformed by the means of relational algebra.

It provides a way to do number of operations including relational multiplication, projection, sum etc.

However, earlier versions or SQL lacked recursion which is required to do certain operations efficiently. Namely, recursion is required to operate upon the adjacency list structure.

The most common operations are:

  • Find all descendants of a given node
  • Find all ancestors of a given node
  • Find all descendants of a given node up to a certain depth

The first two operations require recursion.

The third one does too if the depth level should serve as a parameter.

To work around this, the nested sets model was proposed by Michael Kamfonas and popularized by Joe Celko.

Nested sets

The idea of nested sets is quite simple and can be illustrated using one of the most popular ways to store hierarchical data, the XML.

How would we store the hierarchies in XML?

We would just use one tag to describe every item and nest it accordingly, like this:

<item id="0">
 <item id="1">
  <item id="2"/>
  <item id="3">
   <item id="4"/>
  </item>
  <item id="5">
   <item id="6">
    <item id="7"/>
   </item>
  </item>
 </item>
 <item id="8"/>
</item>

This is fine, but how to use this structure in a relational table?

As you can see, opening and closing tags of each node are contained on their own lines here. The opening and closing tags may (or may not) be the same for the nodes containing no children (this is not important).

We see that the node ranges never intersect: given two node ranges, their intersection always makes the range of either of the nodes or an empty set.

In other words, a node open within another node should also close within it.

If and only if this holds for each and every node, the nodes make a valid XML file and a valid hierarchy.

Now, the nested set model can be described in one sentence:

To store a hierarchy in a nested set model, we just store the line numbers of the opening and closing tags of each node as if it were an XML file.

Each set of nodes having a common ancestor is nested within the node of this ancestor. That's why this model is called nested sets.

Historically, these line numbers are stored in columns named lft and rgt (since LEFT and RIGHT are reserved words in most SQL dialects).

That's how this hierarchy would look in a nested sets model:

ID LFT RGT
0 1 14
1 2 12
2 3 3
3 4 6
4 5 5
5 7 11
6 8 10
7 9 9
8 13 13

This model is more SQL friendly, since the tasks described above can be performed in SQL without using recursion.

To find out all ancestors of a given node, we just select all nodes that contain its LFT boundary (which in a properly built hierarchy implies containing the RGT boundary too):

SELECT  hp.*
FROM    t_hierarchy hc
JOIN    t_hierarchy hp
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hc.id = ?

And to find the descendants, we shoud just reverse the condition, i. e. find all nodes that are contained between the current node's boundaries:

SELECT  hc.*
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = ?

Ironically, selecting all descendants up to a given depth (which is the least problem for the adjacency list as long as the depth is known in design time) is the hardest task for the nested set model. Even obtaning the list of immediate parents and immediate children is not so simple.

However, this is solvable. This is the query to get all descendants up to the third generation (that is node itself, all children and grandchildren):

SELECT  hc.*
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = ?
        AND
        (
        SELECT  COUNT(*)
        FROM    t_hierarchy hn
        WHERE   hc.lft BETWEEN hn.lft AND hn.rgt
                AND hn.lft BETWEEN hp.lft AND hp.rgt
        ) <= 3

Unlike adjacency list model, the depth level can be parametrized in this query which makes it possible to use a single query for all depth level.

Nested set model can be relatively easily queried for, but it's extremely hard to manage.

To insert a new child into a node or make an existing mode a child in adjacency list, everything we need is provide the new value of its parent column. With a single update we can move a whole branch.

To add a new node into a nested set model we should do exactly the same as if we were adding a new node into an XML file: all subsequent nodes are moved several lines further. Since the boundaries in SQL table represent the line numbers, we should do the same: calculate the offset and make a batch update to all nodes to the right of the updated or inserted one. Very hard to implement and very inefficient.

Now good news.

Three of four major systems (that is SQL Server, Oracle and PostgreSQL 8.4) now support recursion natively.

The fourth one (MySQL) does not support it, but it can be emulated to the extent required to run queries against the hierarchical data modelled according to adjacency list model:

In this series of articles, we will compare efficiency of the adjacency list model to that of the nested sets model.

PostgreSQL 8.4 is the system we begin with.

Analysis

PostgreSQL 8.4 supports recursive queries by means of such called hierarchical CTE's.

A hierarchical CTE is an analog of this query:

SELECT  *
FROM    t_hierarchy h1
WHERE   … 

UNION ALL

SELECT  *
FROM    t_hierarchy h1
JOIN    t_hierarchy h2
ON    …
WHERE   … 

UNION ALL

SELECT  *
FROM    t_hierarchy h1
JOIN    t_hierarchy h2
ON    …
JOIN    t_hierarchy h3
ON    …
WHERE   …

…

with the theoretically unlimited number of UNION ALL's built at runtime and the results of each query cached and called recursively.

To define such a construct, one uses WITH RECURSIVE clause.

To compare both methods, we will create a sample table which combines both data models. Each node will have both parent and the boundaries (lft and rgt) defined. Then we will run the three most important queries, which, again, are:

  • Find all descendants of a given node
  • Find all ancestors of a given node
  • Find all descendants of a given node up to a certain depth

Here's the script to create a sample table:

Table creation details

The table contains 8 levels of hierarchy with each node having 5 immediate children. This makes the table 2,441,405 records long.

Each record has a 100-byte long field stuffing which emulates the payload in actual tables.

The fields parent, lft and rgt are indexed.

All descendants

There are lots of descendants, that's why we will select and aggregate the lengths of their stuffing fields. Since that field is not indexed, it will emulate selection of all values from an actual table rather well.

Nested sets

SELECT  SUM(LENGTH(hc.stuffing))
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = 42

View query details

Nested sets is particularly good for this kind of query, since it requires a single range scan on the index on lft.

This query runs for 50 ms.

Adjacency list

WITH    RECURSIVE
        q AS
        (
        SELECT  id, stuffing
        FROM    t_hierarchy h
        WHERE   id = 42
        UNION ALL
        SELECT  hc.id, hc.stuffing
        FROM    q
        JOIN    t_hierarchy hc
        ON      hc.parent = q.id
        )
SELECT  SUM(LENGTH(stuffing))
FROM    q

View query details

This query is a trifle less efficient since it requires several index scans instead of a single one. However, the resulting range is of course the same (because the values returned are the same).

This query completes in 98 ms, or less than twice as long as the nested sets one.

All ancestors

Nested sets

SELECT  hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM    t_hierarchy hc
JOIN    t_hierarchy hp
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hc.id = 1000000
ORDER BY
        hp.lft

View query details

This query returns much fewer rows than the previous one (only 9 rows instead of almost 200,000), but due to its nature it is much more slow and takes almost 2 seconds.

This is because we search the other way round in this case: instead of looking for indexed value within the range of constants, we need to search the constant against the list of ranges.

Ranges cannot be efficiently indexed using B-Tree indexes, that's why PostgreSQL uses only part of the condition (hc.lft <= hp.rgt), builds a bitmap on it, scans the table using this bitmap and filters the values using the second part of the condition (hc.lft <= hp.rgt).

This is quite a costly operations since it requires an index scan (which PostgreSQL is not very good at) which returns almost half of all rows.

Adjacency list

WITH    RECURSIVE
        q AS
        (
        SELECT  h.*, 1 AS level
        FROM    t_hierarchy h
        WHERE   id = 1000000
        UNION ALL
        SELECT  hp.*, level + 1
        FROM    q
        JOIN    t_hierarchy hp
        ON      hp.id = q.parent
        )
SELECT  id, parent, lft, rgt, data
FROM    q
ORDER BY
        level DESC

View query details

Now this query is literally instant: only 4 ms which is within the time measurement error range.

Recursion does very good job here: since the hierarchy is limited, traversing the tree upwards takes only 9 index lookups on the PRIMARY KEY and then sorting of 9 values. Both operations are very simple and complete in no time.

Descendants up to a given level

Nested sets

We will run two queries: one with a node close to the root, the second one with a node far from the root.

SELECT  hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM    t_hierarchy hp
JOIN    t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = ?
        AND
        (
        SELECT  COUNT(*)
        FROM    t_hierarchy hn
        WHERE   hc.lft BETWEEN hn.lft AND hn.rgt
                AND hn.lft BETWEEN hp.lft AND hp.rgt
        ) <= 3

View query details for node 42

View query details for node 31,415

We see that the second query is reasonably fast (completes in 50 ms).

However, the first query (which is in fact more often used) takes 120.6 seconds, or more than 2 minutes!

This is because the query should count all ancestors for all descendants that are within the given node.

It's fast for the nodes that are further from the root (since they don't have lots of descendants), but it may become a real problem when trying to obtain, say, children and grandchildren of a root node.

And this is the task most online catalogs begin their work with: they need to show first-level categories and subcategories. 2 minutes is way too much for this.

Adjacency list

WITH    RECURSIVE
        q AS
        (
        SELECT  id, parent, lft, rgt, data, ARRAY[id] AS level
        FROM    t_hierarchy hc
        WHERE   id = ?
        UNION ALL
        SELECT  hc.id, hc.parent, hc.lft, hc.rgt, hc.data, q.level || hc.id
        FROM    q
        JOIN    t_hierarchy hc
        ON      hc.parent = q.id
        WHERE   array_upper(level, 1) < 3
        )
SELECT  id, parent, lft, rgt, data
FROM    q
ORDER BY
        level

Note the ORDER BY and level constructs. They are intended to preserve the tree-like ordering. Arrays are ordered lexicographically in PostgreSQL and each level contains the breadcrumbs from the root node to the current node.

Here are the query results:

View query details for node 42

View query details for node 31,415

As we can see, both queries complete in a little more than 5 ms (instantly) and this time does not depend on the proximity to the root node.

Summary

We have compared the three most common queries that are usually issued against the hierarachical data:

  1. Find all descendants of a given node
  2. Find all ancestors of a given node
  3. Find all descendants of a given node up to a certain depth

The nested set model the fastest for the first query (0.05 s), however, adjacency list shows good performance and is very fast too (selecting 200,000 rows is a matter of less than 0.1 second).

For the second query, adjacency list is much faster, however, the nested sets are still usable.

Finally, for the third query, nested sets model shows dependency on the node.

For a node that has few descendants, the query is rather fast, however, for a node close to the root (to say nothing of the root itself) this query is intolerably slow.

Adjacency list shows superb performance on both nodes.

Conclusion

Given the said above and taking into account that the nested sets model is much harder to manage, we can conclude that adjacency list model should be used to manage hierarchical data in PostgreSQL 8.4.

Nested sets model was a very smart invention to manage hierarchical data in an environment that allowed no recursion. But now, when recursive queries are finally available, adjacency list model is just better.

It yields excellent performance on all three types of queries, outperforms nested sets in two of three most used queries and is extremely simple to manage.

To be continued.

Written by Quassnoi

September 24th, 2009 at 11:00 pm

Posted in PostgreSQL

4 Responses to 'Adjacency list vs. nested sets: PostgreSQL'

Subscribe to comments with RSS or TrackBack to 'Adjacency list vs. nested sets: PostgreSQL'.

  1. I found this post via a Google search on nested sets and adjacency lists. Your consideration of them in this article is excellent and very helpful. I’ll have to start reading your other posts now.

    Dan

    4 May 13 at 01:19

  2. Hi, for “Descendants up to a given level” with nested set: why don’t you cache the depth of each node in separate column? You will not need that wild subquery, right?

    Petko

    14 Jul 13 at 13:24

  3. @Petko: sure, but every denormalization requires additional effort to maintain it.

    If you were to add a parent to the head category, you would need to increase the depth of all descentants by one.

    What you say is a common practice BTW, if depth updates are not expected and/or their slowness can be tolerated. Another common practice is to add parent id to each node, thus combining adjacency list and nested sets.

    Quassnoi

    18 Jul 13 at 19:21

  4. Hi,
    Of course that the updates in the nested set model are a lot slower. And not only those not-really-common operation of adding a parent to the base of the tree – all additions and deletions require updating all nodes to the right of the affected node (as you mentioned actually)

    I notice now that the ancestors query can also be optimized a lot – just fetch the lft and rgt of the child with one select and then retrieve all parents like this

    SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data
    FROM t_hierarchy hp
    WHERE hp.lft :child_right
    ORDER BY hp.lft ASC;

    Both selects will run a lot faster than the one you used.

    Don’t get me wrong – benchmarks like this are really useful and thank you for the efforts! But you should take care to be as fair and objective as possible.

    It would be great if you run your tests with the optimizations I mentioned so we know whether the nested set is really slower than the Postgre recursion when reading.

    Petko

    20 Jul 13 at 21:06

Leave a Reply