Adjacency list vs. nested sets: PostgreSQL
This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.
What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?
First, let's explain what all this means.
Adjacency list
Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N
transitive relations between entities of same domain.
For instance, ancestor-descendant relation is:
- Transitive:
- If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
- Antisymmetric:
- If A is an ancestor of B, then B is never an ancestor of A
- Irreflexive:
- A is never an ancestor to itself
0-1:0-N
- A can have zero, one or many children. A can have zero or one parents.
These relations can be represented by an ordered directed tree.
Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.
Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.
This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called adjacency list.
Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model
and adjacency list model
have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.
Now, since we have a tree here which implies 0-1:0-N
relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY
reference to the table itself.
Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.
This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.
SQL, as we all know, deals with relational tables which can be transformed by the means of relational algebra.
It provides a way to do number of operations including relational multiplication, projection, sum etc.
However, earlier versions or SQL lacked recursion which is required to do certain operations efficiently. Namely, recursion is required to operate upon the adjacency list structure.
The most common operations are:
- Find all descendants of a given node
- Find all ancestors of a given node
- Find all descendants of a given node up to a certain depth
The first two operations require recursion.
The third one does too if the depth level should serve as a parameter.
To work around this, the nested sets model was proposed by Michael Kamfonas and popularized by Joe Celko.
Nested sets
The idea of nested sets is quite simple and can be illustrated using one of the most popular ways to store hierarchical data, the XML.
How would we store the hierarchies in XML?
We would just use one tag to describe every item and nest it accordingly, like this:
<item id="0"> <item id="1"> <item id="2"/> <item id="3"> <item id="4"/> </item> <item id="5"> <item id="6"> <item id="7"/> </item> </item> </item> <item id="8"/> </item>
This is fine, but how to use this structure in a relational table?
As you can see, opening and closing tags of each node are contained on their own lines here. The opening and closing tags may (or may not) be the same for the nodes containing no children (this is not important).
We see that the node ranges never intersect: given two node ranges, their intersection always makes the range of either of the nodes or an empty set.
In other words, a node open within another node should also close within it.
If and only if this holds for each and every node, the nodes make a valid XML file and a valid hierarchy.
Now, the nested set model can be described in one sentence:
Each set of nodes having a common ancestor is nested within the node of this ancestor. That's why this model is called nested sets
.
Historically, these line numbers are stored in columns named lft
and rgt
(since LEFT
and RIGHT
are reserved words in most SQL dialects).
That's how this hierarchy would look in a nested sets model:
ID | LFT | RGT |
---|---|---|
0 | 1 | 14 |
1 | 2 | 12 |
2 | 3 | 3 |
3 | 4 | 6 |
4 | 5 | 5 |
5 | 7 | 11 |
6 | 8 | 10 |
7 | 9 | 9 |
8 | 13 | 13 |
This model is more SQL friendly, since the tasks described above can be performed in SQL without using recursion.
To find out all ancestors of a given node, we just select all nodes that contain its LFT
boundary (which in a properly built hierarchy implies containing the RGT
boundary too):
SELECT hp.* FROM t_hierarchy hc JOIN t_hierarchy hp ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hc.id = ?
And to find the descendants, we shoud just reverse the condition, i. e. find all nodes that are contained between the current node's boundaries:
SELECT hc.* FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hp.id = ?
Ironically, selecting all descendants up to a given depth (which is the least problem for the adjacency list as long as the depth is known in design time) is the hardest task for the nested set model. Even obtaning the list of immediate parents and immediate children is not so simple.
However, this is solvable. This is the query to get all descendants up to the third generation (that is node itself, all children and grandchildren):
SELECT hc.* FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hp.id = ? AND ( SELECT COUNT(*) FROM t_hierarchy hn WHERE hc.lft BETWEEN hn.lft AND hn.rgt AND hn.lft BETWEEN hp.lft AND hp.rgt ) <= 3
Unlike adjacency list model, the depth level can be parametrized in this query which makes it possible to use a single query for all depth level.
Nested set model can be relatively easily queried for, but it's extremely hard to manage.
To insert a new child into a node or make an existing mode a child in adjacency list, everything we need is provide the new value of its parent column. With a single update we can move a whole branch.
To add a new node into a nested set model we should do exactly the same as if we were adding a new node into an XML file: all subsequent nodes are moved several lines further. Since the boundaries in SQL table represent the line numbers, we should do the same: calculate the offset and make a batch update to all nodes to the right of the updated or inserted one. Very hard to implement and very inefficient.
Now good news.
Three of four major systems (that is SQL Server, Oracle and PostgreSQL 8.4) now support recursion natively.
The fourth one (MySQL) does not support it, but it can be emulated to the extent required to run queries against the hierarchical data modelled according to adjacency list model:
In this series of articles, we will compare efficiency of the adjacency list model to that of the nested sets model.
PostgreSQL 8.4 is the system we begin with.
Analysis
PostgreSQL 8.4 supports recursive queries by means of such called hierarchical CTE's.
A hierarchical CTE is an analog of this query:
SELECT * FROM t_hierarchy h1 WHERE … UNION ALL SELECT * FROM t_hierarchy h1 JOIN t_hierarchy h2 ON … WHERE … UNION ALL SELECT * FROM t_hierarchy h1 JOIN t_hierarchy h2 ON … JOIN t_hierarchy h3 ON … WHERE … …
with the theoretically unlimited number of UNION ALL
's built at runtime and the results of each query cached and called recursively.
To define such a construct, one uses WITH RECURSIVE
clause.
To compare both methods, we will create a sample table which combines both data models. Each node will have both parent
and the boundaries (lft
and rgt
) defined. Then we will run the three most important queries, which, again, are:
- Find all descendants of a given node
- Find all ancestors of a given node
- Find all descendants of a given node up to a certain depth
Here's the script to create a sample table:
The table contains 8 levels of hierarchy with each node having 5 immediate children. This makes the table 2,441,405 records long.
Each record has a 100-byte long field stuffing
which emulates the payload in actual tables.
The fields parent
, lft
and rgt
are indexed.
All descendants
There are lots of descendants, that's why we will select and aggregate the lengths of their stuffing
fields. Since that field is not indexed, it will emulate selection of all values from an actual table rather well.
Nested sets
SELECT SUM(LENGTH(hc.stuffing)) FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hp.id = 42
Nested sets is particularly good for this kind of query, since it requires a single range scan on the index on lft
.
This query runs for 50 ms.
Adjacency list
WITH RECURSIVE q AS ( SELECT id, stuffing FROM t_hierarchy h WHERE id = 42 UNION ALL SELECT hc.id, hc.stuffing FROM q JOIN t_hierarchy hc ON hc.parent = q.id ) SELECT SUM(LENGTH(stuffing)) FROM q
This query is a trifle less efficient since it requires several index scans instead of a single one. However, the resulting range is of course the same (because the values returned are the same).
This query completes in 98 ms, or less than twice as long as the nested sets one.
All ancestors
Nested sets
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data FROM t_hierarchy hc JOIN t_hierarchy hp ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hc.id = 1000000 ORDER BY hp.lft
This query returns much fewer rows than the previous one (only 9 rows instead of almost 200,000), but due to its nature it is much more slow and takes almost 2 seconds.
This is because we search the other way round in this case: instead of looking for indexed value within the range of constants, we need to search the constant against the list of ranges.
Ranges cannot be efficiently indexed using B-Tree indexes, that's why PostgreSQL uses only part of the condition (hc.lft <= hp.rgt
), builds a bitmap on it, scans the table using this bitmap and filters the values using the second part of the condition (hc.lft <= hp.rgt
).
This is quite a costly operations since it requires an index scan (which PostgreSQL is not very good at) which returns almost half of all rows.
Adjacency list
WITH RECURSIVE q AS ( SELECT h.*, 1 AS level FROM t_hierarchy h WHERE id = 1000000 UNION ALL SELECT hp.*, level + 1 FROM q JOIN t_hierarchy hp ON hp.id = q.parent ) SELECT id, parent, lft, rgt, data FROM q ORDER BY level DESC
Now this query is literally instant: only 4 ms which is within the time measurement error range.
Recursion does very good job here: since the hierarchy is limited, traversing the tree upwards takes only 9 index lookups on the PRIMARY KEY
and then sorting of 9 values. Both operations are very simple and complete in no time.
Descendants up to a given level
Nested sets
We will run two queries: one with a node close to the root, the second one with a node far from the root.
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hp.id = ? AND ( SELECT COUNT(*) FROM t_hierarchy hn WHERE hc.lft BETWEEN hn.lft AND hn.rgt AND hn.lft BETWEEN hp.lft AND hp.rgt ) <= 3
View query details for node 42
View query details for node 31,415
We see that the second query is reasonably fast (completes in 50 ms).
However, the first query (which is in fact more often used) takes 120.6 seconds, or more than 2 minutes!
This is because the query should count all ancestors for all descendants that are within the given node.
It's fast for the nodes that are further from the root (since they don't have lots of descendants), but it may become a real problem when trying to obtain, say, children and grandchildren of a root node.
And this is the task most online catalogs begin their work with: they need to show first-level categories and subcategories. 2 minutes is way too much for this.
Adjacency list
WITH RECURSIVE q AS ( SELECT id, parent, lft, rgt, data, ARRAY[id] AS level FROM t_hierarchy hc WHERE id = ? UNION ALL SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data, q.level || hc.id FROM q JOIN t_hierarchy hc ON hc.parent = q.id WHERE array_upper(level, 1) < 3 ) SELECT id, parent, lft, rgt, data FROM q ORDER BY level
Note the ORDER BY
and level
constructs. They are intended to preserve the tree-like ordering. Arrays are ordered lexicographically in PostgreSQL and each level
contains the breadcrumbs
from the root node to the current node.
Here are the query results:
View query details for node 42
View query details for node 31,415
As we can see, both queries complete in a little more than 5 ms (instantly) and this time does not depend on the proximity to the root node.
Summary
We have compared the three most common queries that are usually issued against the hierarachical data:
- Find all descendants of a given node
- Find all ancestors of a given node
- Find all descendants of a given node up to a certain depth
The nested set model the fastest for the first query (0.05 s), however, adjacency list shows good performance and is very fast too (selecting 200,000 rows is a matter of less than 0.1 second).
For the second query, adjacency list is much faster, however, the nested sets are still usable.
Finally, for the third query, nested sets model shows dependency on the node.
For a node that has few descendants, the query is rather fast, however, for a node close to the root (to say nothing of the root itself) this query is intolerably slow.
Adjacency list shows superb performance on both nodes.
Conclusion
Given the said above and taking into account that the nested sets model is much harder to manage, we can conclude that adjacency list model should be used to manage hierarchical data in PostgreSQL 8.4.
Nested sets model was a very smart invention to manage hierarchical data in an environment that allowed no recursion. But now, when recursive queries are finally available, adjacency list model is just better.
It yields excellent performance on all three types of queries, outperforms nested sets in two of three most used queries and is extremely simple to manage.
To be continued.
I found this post via a Google search on nested sets and adjacency lists. Your consideration of them in this article is excellent and very helpful. I’ll have to start reading your other posts now.
Dan
4 May 13 at 01:19
Hi, for “Descendants up to a given level” with nested set: why don’t you cache the depth of each node in separate column? You will not need that wild subquery, right?
Petko
14 Jul 13 at 13:24
@Petko: sure, but every denormalization requires additional effort to maintain it.
If you were to add a parent to the head category, you would need to increase the depth of all descentants by one.
What you say is a common practice BTW, if depth updates are not expected and/or their slowness can be tolerated. Another common practice is to add parent id to each node, thus combining adjacency list and nested sets.
Quassnoi
18 Jul 13 at 19:21
Hi,
Of course that the updates in the nested set model are a lot slower. And not only those not-really-common operation of adding a parent to the base of the tree – all additions and deletions require updating all nodes to the right of the affected node (as you mentioned actually)
I notice now that the ancestors query can also be optimized a lot – just fetch the lft and rgt of the child with one select and then retrieve all parents like this
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM t_hierarchy hp
WHERE hp.lft :child_right
ORDER BY hp.lft ASC;
Both selects will run a lot faster than the one you used.
Don’t get me wrong – benchmarks like this are really useful and thank you for the efforts! But you should take care to be as fair and objective as possible.
It would be great if you run your tests with the optimizations I mentioned so we know whether the nested set is really slower than the Postgre recursion when reading.
Petko
20 Jul 13 at 21:06
very good point….when some DB now support Recursion, adjacency model could be a better choice
budy
22 Oct 15 at 10:30
very nice post.
I have a question..
My tree is categories(categoryid,title,parentid)
and i want to fetch tree with right order.
How can i do this?
The following query in postgresql fetches the tree
but is out of order…
WITH RECURSIVE
q AS
(
SELECT categoryid,title, 0 AS level
FROM categories h
WHERE parentid IS NULL
UNION ALL
SELECT hc.categoryid, hc.title, level +1
FROM q
JOIN categories hc
ON hc.parentid = q.categoryid
)
SELECT * FROM q
can you help me ? thanks.
cncakias
29 Dec 15 at 21:08
How about having two tables:
nodes: (node_id, level)
parent_child: (parent_id, child_id)
Where each node has exactly ‘level’ rows in parent_child.
Dzmitry
11 Jan 16 at 13:24
How does adjacent model compare with ltree’s approach in Postgresql?
Thomas
22 Nov 16 at 16:04
Hi,
very good article but 1 case has been omitted.
The nested set won’t works in case of genealogy tree with consanguinity. :/
So which method to use in this case (on MS SQL Server) ?
Vincent
Vincent DUVERNET
5 Oct 17 at 01:25
The formatting of some code blocks seems to be broken. Love the article tho. Very helpful.
Tim
17 Feb 19 at 22:37
@Tim: thanks for the heads up, still dealing with the fallout from the Gutenberg update
Quassnoi
17 Feb 19 at 23:04
Hi,
thanks for the article!
I believe that there is a small error in definitions sections. You claim that the relation is “irreflexive” which it is but the definition you give it to it is the definition of antisymmetry. So the relation is both antisymmetrical and irreflexive (A is never an ancestor of A).
Antti Järvinen
9 Mar 19 at 14:39
@Antti: you’re right, thanks for the heads up!
Quassnoi
10 Mar 19 at 14:50