EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: using recursive functions in nested sets

Comments enabled. I *really* need your comment

In the previous article, I discussed a way to improve nested sets model in PostgreSQL.

The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.

This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)).

That solution was pure SQL and it was quite fast, but not optimal.

The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.

We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.

Let's create a sample table:

Table creation details

If we run the query introduced in the previous article to fetch all children up to level 2 from a really top node, we get the following results:

WITH    RECURSIVE
        q AS
        (
        SELECT  id, lft, rgt, 1 AS lvl
        FROM    t_hierarchy
        WHERE   id = 1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  DISTINCT ON (MAX(hc.rgt) OVER (PARTITION BY q.id ORDER BY hc.lft)) hc.id, hc.lft, hc.rgt, lvl + 1
                FROM    q
                JOIN    t_hierarchy hc
                ON      hc.lft > q.lft
                        AND hc.lft < q.rgt
                WHERE   lvl <= 2
                ORDER BY
                        MAX(hc.rgt) OVER (PARTITION BY q.id ORDER BY hc.lft), hc.lft
                ) q2
        )
SELECT  *
FROM    q

View query details

This runs for almost 15 seconds: too much.

This can be improved by exploiting these two properties of the nested sets model:

  1. The first immediate child of a node is the node holding the first lft next to the node's lft

  2. The next sibling of a node is the node holding the first lft next to the node's rgt

If we recursively traverse through the nodes, we can find the first child as well as all of its siblings. This is enough to build a hierarchy, and level filter can be implemented merely by limiting the recursion depth.

However, recursive CTE's only allow one recursion level. We cannot nest the WITH clause.

To work around that, we can use PostgreSQL's ability to run set-returning functions recursively. We will use the function-based recursion to iterate the parent-child axis, and the CTE-based recursion to iterate siblings axis.

We need to create a function that would take a node's id on input and return a set of its children on output, with the function recursively applied to each of the children. To find a set of children, we will implement a recursive CTE that finds the first child in the anchor part and the next sibling in the recursive part.

Here's the function:

CREATE OR REPLACE FUNCTION fn_get_children(id INT, level INT)
RETURNS SETOF INT[] AS
$$
        WITH    RECURSIVE q AS
                (
                SELECT  (hc).id, (hc).lft, (hc).rgt, prgt
                FROM    (
                        SELECT  (
                                SELECT  hc
                                FROM    t_hierarchy hc
                                WHERE   hc.lft > hp.lft
                                        AND hc.lft < hp.rgt
                                ORDER BY
                                        hc.lft
                                LIMIT 1
                                ) hc,
                                rgt AS prgt
                        FROM    t_hierarchy hp
                        WHERE   hp.id = $1
                        ) q2
                UNION ALL
                SELECT  (hc).id, (hc).lft, (hc).rgt, prgt
                FROM    (
                        SELECT  (
                                SELECT  hc
                                FROM    t_hierarchy hc
                                WHERE   hc.lft > q.rgt
                                        AND hc.lft < q.prgt
                                ORDER BY
                                        hc.lft
                                LIMIT 1
                                ) hc,
                                prgt
                        FROM    q
                        WHERE   q.lft IS NOT NULL
                        ) q2
                )
        SELECT  CASE which
                WHEN 1 THEN ARRAY[q.id, $2]
                ELSE fn_get_children(q.id, $2 - 1)
                END
        FROM    (
                VALUES (1), (2)
                ) vals(which)
        CROSS JOIN
                q
        WHERE   q.id IS NOT NULL
                AND $2 > 0
        ORDER BY
                id, which;
$$ LANGUAGE sql;

The function accepts a node's id and a level on input, and returns a set of arrays, each corresponding to one of the node's children and its level. The level returned by the function decreases and in fact represents not the level as such, but the number of levels left to reach the filter-set bottom. But since the initial level is user-set, it is easy to cast it to the actual level.

Let's run the function:

SELECT  c[1], 3 - c[2]
FROM    fn_get_children(1, 2) c;
c ?column?
6 1
34 2
35 2
33 2
31 2
32 2
7 1
38 2
40 2
39 2
36 2
37 2
8 1
41 2
42 2
43 2
44 2
45 2
9 1
47 2
50 2
46 2
48 2
49 2
10 1
53 2
55 2
51 2
52 2
54 2
30 rows fetched in 0.0023s (0.0658s)
Function Scan on fn_get_children c  (cost=0.00..262.50 rows=1000 width=32)

As we can see, the function returned all children and grandchildren of the node 1 along with their level, and did it in only 65 ms.

Written by Quassnoi

March 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

Leave a Reply