EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Things SQL needs: determining range cardinality. You're welcome to read and comment on it.

Adjacency list vs. nested sets: SQL Server

Continuing the series:

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Now, let’s see what’s better for SQL Server.

We will create a single table that holds both adjacency list data and nested sets data, with 8 levels of nesting, 5 children of each parent node and 2,441,405 records:

Table creation details

We will test the performance of 3 most common operations:

  • Find all descendants of a given node
  • Find all ancestors of a given node
  • Find all descendants of a given node up to a certain depth

All descendants

We will select all descendants of a given node (which is close to the root node and has lots of descendants).

Since there are more than 100,000 descendants, we will select the aggregated length of the stuffing columns rather than a huge resultset of values. All rows are required to be visited anyway, so this is quite good performance test.

Nested sets

SELECT  SUM(LEN(hc.stuffing))
FROM    [20090925_nested].t_hierarchy hp
JOIN    [20090925_nested].t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = 42

View query details

Nested sets model is good for this query, since it required only a single range scan on lft to get all descendants.

This query completes in 81 ms.

Adjacency list

WITH    q AS
        (
        SELECT  id, stuffing
        FROM    [20090925_nested].t_hierarchy h
        WHERE   id = 42
        UNION ALL
        SELECT  hc.id, hc.stuffing
        FROM    q
        JOIN    [20090925_nested].t_hierarchy hc
        ON      hc.parent = q.id
        )
SELECT  SUM(LEN(stuffing))
FROM    q

View query details

This query makes several range scans (one per node) using recursion.

These ranges finally sum up to the same set of values (of course), however, the parents of descendants are not contiguous and no single range can select them all. That's why this query is less efficient than nested sets and runs for 531 ms, or about 6 times as long.

All ancestors

Nested sets

SELECT  hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM    [20090925_nested].t_hierarchy hc
JOIN    [20090925_nested].t_hierarchy hp
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hc.id = 1000000
ORDER BY
        hp.lft

View query details

This is extremely inefficient: 16 seconds.

Unlike PostgreSQL, SQL Server is not capable of building in-memory bitmaps from the index results. It just has to seek the index and join. And the join condition used here is not an equality, thus making all join methods except NESTED LOOPS impossible to use.

All this results in a very poor performance which is far beyond that of even a marginally usable query.

Adjacency list

WITH    q AS
        (
        SELECT  h.*, 1 AS level
        FROM    [20090925_nested].t_hierarchy h
        WHERE   id = 1000000
        UNION ALL
        SELECT  hp.*, level + 1
        FROM    q
        JOIN    [20090925_nested].t_hierarchy hp
        ON      hp.id = q.parent
        )
SELECT  id, parent, lft, rgt, data
FROM    q
ORDER BY
        level DESC

View query details

Now, this query is a boon for adjacency list model. There are 9 ancestors that can be retrieved in 9 clustered index seeks and these 9 values then need to be sorted. SQL Server eats such queries for breakfast.

This query completes in 1 ms which is the range of time measurement error in SQL Server statistics.

In other words, this query is very very fast.

Descendants up to a given level

Since the performance on this task may depend on the number of the descendants, we will run 2 queries for each model.

This first query selects descendants of a node that is close to the root (node 42), the second one selects those of node 31,415 which is further from the root.

Nested sets

SELECT  hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM    [20090925_nested].t_hierarchy hp
JOIN    [20090925_nested].t_hierarchy hc
ON      hc.lft BETWEEN hp.lft AND hp.rgt
WHERE   hp.id = ?
        AND
        (
        SELECT  COUNT(*)
        FROM    [20090925_nested].t_hierarchy hn
        WHERE   hc.lft BETWEEN hn.lft AND hn.rgt
                AND hn.lft BETWEEN hp.lft AND hp.rgt
        ) <= 3

View results for node 42

View results for node 31,415

These queries complete in 108 seconds and 38 seconds, respectively. Both these queries use nested loops with a full table scan which doesn't add to performance. This makes the queries unusable.

Adjacency list

WITH    q AS
        (
        SELECT  id, parent, lft, rgt, data, 1 AS level,
                CAST(LEFT(CAST(id AS VARCHAR) + REPLICATE('0', 10), 10) AS VARCHAR) AS bc
        FROM    [20090925_nested].t_hierarchy hc
        WHERE   id = ?
        UNION ALL
        SELECT  hc.id, hc.parent, hc.lft, hc.rgt, hc.data, level + 1,
                CAST(bc + '.' + LEFT(CAST(hc.id AS VARCHAR) + REPLICATE('0', 10), 10) AS VARCHAR)
        FROM    q
        JOIN    [20090925_nested].t_hierarchy hc
        ON      hc.parent = q.id
        WHERE   level < 3
        )
SELECT  id, parent, lft, rgt, data
FROM    q
ORDER BY
        bc

The field bc defines breadcrumbs which can be used to return the nodes in tree order.

In PostgreSQL 8.4, we used an array for that.

SQL Server lacks array type, that's why we need to use a hack: converting the id's into zero-padded strings of fixed length (10 characters) and concatenating them.

Here are the query results:

View results for node 42

View results for node 31,415

Both queries use the same efficient plans (which basically include 31 index seeks and ordering of the values fetched) and complete in 2 ms, which may be considered instant.

Summary

We made an SQL Server table combining two popular models for hierarchical data: adjacency list and nested sets and tested three most used queries against these data:

  1. Find all descendants of a given node
  2. Find all ancestors of a given node
  3. Find all descendants of a given node up to a certain depth

As in the similar test for PostgreSQL 8.4 (conducted in the previous article), nested sets model has shown a little performance benefit for the first query (finding all descendants).

However, SQL Server, unlike PostgreSQL, is not able to build in-memory bitmaps which makes nested loops (required for the queries 2 and 3 against the nested sets model) very inefficient.

This makes the nested sets model very slow for queries 2 and 3 in SQL Server. More or less significant number of records in the table makes these queries too slow to be of practical use.

Adjacency list, on the contrary, shows decent performance for the query 1 and amazing performance for the queries 2 and 3.

This fact, along with the simplicity of adjacency list implementation and management, make adjacency list a preferred way to store hierarchical data in SQL Server versions that support recursive CTEs, that is versions from 2005 onwards.

Written by Quassnoi

September 25th, 2009 at 11:00 pm

Posted in SQL Server

Comments are closed.