PostgreSQL: using recursive functions in nested sets at EXPLAIN EXTENDED

PostgreSQL: using recursive functions in nested sets

In the previous article, I discussed a way to improve nested sets model in PostgreSQL.

The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.

This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)).

That solution was pure SQL and it was quite fast, but not optimal.

The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.

We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.

Let's create a sample table:

Table creation details

CREATE TABLE t_hierarchy (
        id INT NOT NULL,
        parent INT NOT NULL,
        lft INT NOT NULL,
        rgt INT NOT NULL,
        data VARCHAR(100) NOT NULL,
        stuffing VARCHAR(100) NOT NULL
);

INSERT
INTO    t_hierarchy
WITH RECURSIVE
        ini AS
        (
        SELECT  8 AS level, 5 AS children
        ),
        range AS
        (
        SELECT  level, children,
                (
                SELECT  SUM(POW(children, n)::INTEGER * ((n < level)::INTEGER + 1))
                FROM    generate_series(level, 0, -1) n
                ) width
        FROM    ini
        ),
        q AS
        (
        SELECT  s AS id, 0 AS parent, level, children,
                1 + width * (s - 1) AS lft,
                1 + width * s - 1 AS rgt,
                width / children AS width
        FROM    (
                SELECT  r.*, generate_series(1, children) s
                FROM    range r
                ) q2
        UNION ALL
        SELECT  id * children + position, id, level - 1, children,
                1 + lft + width * (position - 1),
                1 + lft + width * position - 1,
                width / children
        FROM    (
                SELECT  generate_series(1, children) AS position, q.*
                FROM    q
                ) q2
        WHERE   level > 0
        )
SELECT  id, parent, lft, rgt, 'Value ' || id, RPAD('', 100, '*')
FROM    q;

ALTER TABLE t_hierarchy ADD CONSTRAINT pk_hierarchy_id PRIMARY KEY (id);
CREATE UNIQUE INDEX ux_hierarchy_lft ON t_hierarchy (lft);
CREATE UNIQUE INDEX ux_hierarchy_rgt ON t_hierarchy (rgt);
CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent);
CREATE INDEX ix_hierarchy_sets ON t_hierarchy USING GIST(POLYGON(BOX(POINT(-1, lft), POINT(1, rgt))));

If we run the query introduced in the previous article to fetch all children up to level 2 from a really top node, we get the following results:

WITH    RECURSIVE
        q AS
        (
        SELECT  id, lft, rgt, 1 AS lvl
        FROM    t_hierarchy
        WHERE   id = 1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  DISTINCT ON (MAX(hc.rgt) OVER (PARTITION BY q.id ORDER BY hc.lft)) hc.id, hc.lft, hc.rgt, lvl + 1
                FROM    q
                JOIN    t_hierarchy hc
                ON      hc.lft > q.lft
                        AND hc.lft < q.rgt
                WHERE   lvl <= 2
                ORDER BY
                        MAX(hc.rgt) OVER (PARTITION BY q.id ORDER BY hc.lft), hc.lft
                ) q2
        )
SELECT  *
FROM    q

View query details

id	lft	rgt	lvl
1	1	585937	1
6	2	117188	2
7	117189	234375	2
8	234376	351562	2
9	351563	468749	2
10	468750	585936	2
31	3	23439	3
32	23440	46876	3
33	46877	70313	3
34	70314	93750	3
35	93751	117187	3
36	117190	140626	3
37	140627	164063	3
38	164064	187500	3
39	187501	210937	3
40	210938	234374	3
41	234377	257813	3
42	257814	281250	3
43	281251	304687	3
44	304688	328124	3
45	328125	351561	3
46	351564	375000	3
47	375001	398437	3
48	398438	421874	3
49	421875	445311	3
50	445312	468748	3
51	468751	492187	3
52	492188	515624	3
53	515625	539061	3
54	539062	562498	3
55	562499	585935	3
31 rows fetched in 0.0169s (14.7499s)

CTE Scan on q  (cost=3923687.62..4086447.04 rows=8137971 width=16)
  CTE q
    ->  Recursive Union  (cost=0.00..3923687.62 rows=8137971 width=16)
          ->  Index Scan using pk_hierarchy_id on t_hierarchy  (cost=0.00..8.54 rows=1 width=12)
                Index Cond: (id = 1)
          ->  Subquery Scan q2  (cost=363885.01..376091.97 rows=813797 width=16)
                ->  Unique  (cost=363885.01..367954.00 rows=813797 width=20)
                      ->  Sort  (cost=363885.01..365919.50 rows=813797 width=20)
                            Sort Key: (max(hc.rgt) OVER (?)), hc.lft
                            ->  WindowAgg  (cost=265682.87..283993.30 rows=813797 width=20)
                                  ->  Sort  (cost=265682.87..267717.36 rows=813797 width=20)
                                        Sort Key: q.id, hc.lft
                                        ->  Nested Loop  (cost=5335.66..185791.16 rows=813797 width=20)
                                              ->  WorkTable Scan on q  (cost=0.00..0.22 rows=3 width=16)
                                                    Filter: (lvl <= 2)
                                              ->  Bitmap Heap Scan on t_hierarchy hc  (cost=5335.66..57861.32 rows=271266 width=12)
                                                    Recheck Cond: ((hc.lft > q.lft) AND (hc.lft < q.rgt))
                                                    ->  Bitmap Index Scan on ux_hierarchy_lft  (cost=0.00..5267.84 rows=271266 width=0)
                                                          Index Cond: ((hc.lft > q.lft) AND (hc.lft < q.rgt))

This runs for almost 15 seconds: too much.

This can be improved by exploiting these two properties of the nested sets model:

The first immediate child of a node is the node holding the first lft next to the node's lft
The next sibling of a node is the node holding the first lft next to the node's rgt

If we recursively traverse through the nodes, we can find the first child as well as all of its siblings. This is enough to build a hierarchy, and level filter can be implemented merely by limiting the recursion depth.

However, recursive CTE's only allow one recursion level. We cannot nest the WITH clause.

To work around that, we can use PostgreSQL's ability to run set-returning functions recursively. We will use the function-based recursion to iterate the parent-child axis, and the CTE-based recursion to iterate siblings axis.

We need to create a function that would take a node's id on input and return a set of its children on output, with the function recursively applied to each of the children. To find a set of children, we will implement a recursive CTE that finds the first child in the anchor part and the next sibling in the recursive part.

Here's the function:

CREATE OR REPLACE FUNCTION fn_get_children(id INT, level INT)
RETURNS SETOF INT[] AS
$$
        WITH    RECURSIVE q AS
                (
                SELECT  (hc).id, (hc).lft, (hc).rgt, prgt
                FROM    (
                        SELECT  (
                                SELECT  hc
                                FROM    t_hierarchy hc
                                WHERE   hc.lft > hp.lft
                                        AND hc.lft < hp.rgt
                                ORDER BY
                                        hc.lft
                                LIMIT 1
                                ) hc,
                                rgt AS prgt
                        FROM    t_hierarchy hp
                        WHERE   hp.id = $1
                        ) q2
                UNION ALL
                SELECT  (hc).id, (hc).lft, (hc).rgt, prgt
                FROM    (
                        SELECT  (
                                SELECT  hc
                                FROM    t_hierarchy hc
                                WHERE   hc.lft > q.rgt
                                        AND hc.lft < q.prgt
                                ORDER BY
                                        hc.lft
                                LIMIT 1
                                ) hc,
                                prgt
                        FROM    q
                        WHERE   q.lft IS NOT NULL
                        ) q2
                )
        SELECT  CASE which
                WHEN 1 THEN ARRAY[q.id, $2]
                ELSE fn_get_children(q.id, $2 - 1)
                END
        FROM    (
                VALUES (1), (2)
                ) vals(which)
        CROSS JOIN
                q
        WHERE   q.id IS NOT NULL
                AND $2 > 0
        ORDER BY
                id, which;
$$ LANGUAGE sql;

The function accepts a node's id and a level on input, and returns a set of arrays, each corresponding to one of the node's children and its level. The level returned by the function decreases and in fact represents not the level as such, but the number of levels left to reach the filter-set bottom. But since the initial level is user-set, it is easy to cast it to the actual level.

Let's run the function:

SELECT  c[1], 3 - c[2]
FROM    fn_get_children(1, 2) c;

c	?column?
6	1
34	2
35	2
33	2
31	2
32	2
7	1
38	2
40	2
39	2
36	2
37	2
8	1
41	2
42	2
43	2
44	2
45	2
9	1
47	2
50	2
46	2
48	2
49	2
10	1
53	2
55	2
51	2
52	2
54	2
30 rows fetched in 0.0023s (0.0658s)

Function Scan on fn_get_children c  (cost=0.00..262.50 rows=1000 width=32)

As we can see, the function returned all children and grandchildren of the node 1 along with their level, and did it in only 65 ms.

Written by Quassnoi

March 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

One Response to 'PostgreSQL: using recursive functions in nested sets'

Subscribe to comments with RSS

Thank you!! Is this still possible with PostgreSQL 10? it doesn’t support set-returning functions anymore.

kgapp

3 Nov 20 at 05:11

EXPLAIN EXTENDED