Recursive CTE's: PostgreSQL at EXPLAIN EXTENDED

Recursive CTE’s: PostgreSQL

Comments enabled. I *really* need your comment

In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.

SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL, and each recursive step sees only a single record from the previous step.

Now, let's check the same operations in PostgreSQL 8.4.

To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.

This is a set-based operation.

Oracle's CONNECT BY syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.

In a true set-based system, this is much more simple.

Let's create a sample table:

CREATE TABLE t_recursive (
id INT NOT NULL PRIMARY KEY,
parent INT NOT NULL,
orderer INT NOT NULL,
data VARCHAR(100) NOT NULL
);

CREATE INDEX ix_recursive_parent_orderer ON t_recursive (parent, orderer);

SELECT  SETSEED(0.20091123);

INSERT
INTO    t_recursive
SELECT  s, (s - 1) / 5, FLOOR(RANDOM() * 10000), 'Item ' || s
FROM    generate_series(1, 1000000) s;

This table contains 1,000,000 records and implements an adjacency tree hierarchy.

Each item has at most 5 children and a randomly filled column, orderer, which defines its order along its siblings.

Now, let's try to make a query that would select the first branch, the order being defined by the value provided in orderer.

To do this, we should make the anchor step to return the first child of the root item, and the recursive steps to return the first child of the previously returned item.

To return the first child of a given parent in the orderer order, we can use PostgreSQL's DISTINCT ON functionality.

This is the same as DISTINCT, but can return the whole row rather than a column DISTINCT is being applied to.

If two rows share the value of the column DISTINCT ON is being applied to, only one of the rows will be returned. Which row will it be is defined by the ORDER BY clause.

This solves the problems like return a single row that holds group-wise maximum.

This query:

SELECT  DISTINCT ON (parent) *
FROM    t_recursive
ORDER BY
parent, orderer

is the same as this one:

SELECT  (q.r).*
FROM    (
SELECT  r, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY orderer) AS rn
FROM    t_recursive r
) q
WHERE   rn = 1

, but more legible and in some cases more efficient.

To apply this clause to the task we need, we should just use DISTINCT ON (parent) in both anchor and recursive parts:

WITH    RECURSIVE
rows AS
(
SELECT  *
FROM    (
SELECT  DISTINCT ON (parent) *
FROM    t_recursive
WHERE   parent = 0
ORDER BY
parent, orderer
) q
UNION ALL
SELECT  *
FROM    (
SELECT  DISTINCT ON (c.parent) c.*
FROM    rows r
JOIN    t_recursive c
ON      c.parent = r.id
ORDER BY
c.parent, c.orderer
) q2
)
SELECT  *
FROM    rows

id	parent	orderer	data
3	0	1686	Item 3
19	3	3370	Item 19
98	19	42	Item 98
492	98	1762	Item 492
2464	492	2295	Item 2464
12322	2464	2050	Item 12322
61614	12322	768	Item 61614
308074	61614	1925	Item 308074
8 rows fetched in 0.0004s (0.0042s)

CTE Scan on rows  (cost=3188.19..3207.83 rows=982 width=230)
  CTE rows
    ->  Recursive Union  (cost=0.00..3188.19 rows=982 width=230)
          ->  Unique  (cost=0.00..15.46 rows=2 width=23)
                ->  Index Scan using ix_recursive_parent_orderer on t_recursive  (cost=0.00..15.45 rows=5 width=23)
                      Index Cond: (parent = 0)
          ->  Unique  (cost=313.84..314.33 rows=98 width=23)
                ->  Sort  (cost=313.84..314.08 rows=98 width=23)
                      Sort Key: c.parent, c.orderer
                      ->  Nested Loop  (cost=0.00..310.60 rows=98 width=23)
                            ->  WorkTable Scan on rows r  (cost=0.00..0.40 rows=20 width=4)
                            ->  Index Scan using ix_recursive_parent_orderer on t_recursive c  (cost=0.00..15.45 rows=5 width=23)
                                  Index Cond: (c.parent = r.id)

This gives us a single branch containing just the records we need: each one being the first child of its parent.

Now, what if we wanted to return a list of records, each being the second child to its parent?

Since each recursive part takes only one record as an input (and returns one record as an output), we can just replace DISTINCT ON (which returns the first child of each group) with a OFFSET 1 LIMIT 1:

WITH    RECURSIVE
rows AS
(
SELECT  *
FROM    (
SELECT  *
FROM    t_recursive
WHERE   parent = 0
ORDER BY
orderer
OFFSET 1 LIMIT 1
) q
UNION ALL
SELECT  *
FROM    (
SELECT  c.*
FROM    rows r
JOIN    t_recursive c
ON      c.parent = r.id
ORDER BY
c.orderer
OFFSET 1 LIMIT 1
) q2
)
SELECT  *
FROM    rows

id	parent	orderer	data
1	0	2540	Item 1
6	1	3405	Item 6
33	6	2884	Item 33
166	33	3084	Item 166
833	166	1848	Item 833
4169	833	993	Item 4169
20850	4169	3126	Item 20850
104251	20850	3021	Item 104251
521256	104251	5492	Item 521256
9 rows fetched in 0.0005s (0.0043s)

CTE Scan on rows  (cost=1564.44..1564.66 rows=11 width=230)
  CTE rows
    ->  Recursive Union  (cost=3.09..1564.44 rows=11 width=230)
          ->  Limit  (cost=3.09..6.18 rows=1 width=23)
                ->  Index Scan using ix_recursive_parent_orderer on t_recursive  (cost=0.00..15.45 rows=5 width=23)
                      Index Cond: (parent = 0)
          ->  Limit  (cost=155.79..155.79 rows=1 width=23)
                ->  Sort  (cost=155.79..155.91 rows=49 width=23)
                      Sort Key: c.orderer
                      ->  Nested Loop  (cost=0.00..155.30 rows=49 width=23)
                            ->  WorkTable Scan on rows r  (cost=0.00..0.20 rows=10 width=4)
                            ->  Index Scan using ix_recursive_parent_orderer on t_recursive c  (cost=0.00..15.45 rows=5 width=23)
                                  Index Cond: (c.parent = r.id)

This query returns a whole branch of the items that are second children to their parents.

Both these queries are available in SQL Server 2005: despite the fact they do not distribute over UNION ALL, they can be rewritten using a ROW_NUMBER().

As was shown in the previous article, for some strange reason, SQL Server 2005 does not forbid this clause, but rather implements it in the incorrect way. This, however, allows writing the query we're after.

Now, let's make a query that would recursively return the first two children.

This requires a set-based solution, since first two children can come from different parents.

A grandchild that is third to its grandfather won't count, even if it was a first child to the first child of the grandfather.

A first child to the third child of the grandfather won't count either, even if it was the first grandchild to its grandfather.

To do this right, on each step the query should accept 2 items, return 2 items and process the items accepted at once.

This is also easy in a true set-based recursive CTE:

WITH    RECURSIVE
rows AS
(
SELECT  *
FROM    (
SELECT  *
FROM    t_recursive r
WHERE   parent = 0
ORDER BY
orderer
LIMIT 2
) q
UNION ALL
SELECT  *
FROM    (
SELECT  c.*
FROM    rows r
JOIN    t_recursive c
ON      c.parent = r.id
ORDER BY
c.orderer
LIMIT 2
) q
)
SELECT  *
FROM    rows r

id	parent	orderer	data
3	0	1686	Item 3
1	0	2540	Item 1
8	1	2181	Item 8
19	3	3370	Item 19
98	19	42	Item 98
99	19	1351	Item 99
497	99	1245	Item 497
496	99	1255	Item 496
2486	497	205	Item 2486
2484	496	362	Item 2484
12431	2486	29	Item 12431
12423	2484	311	Item 12423
62119	12423	1113	Item 62119
62120	12423	1121	Item 62120
310602	62120	341	Item 310602
310605	62120	468	Item 310605
16 rows fetched in 0.0007s (0.0043s)

CTE Scan on rows r  (cost=3122.65..3123.09 rows=22 width=230)
  CTE rows
    ->  Recursive Union  (cost=0.00..3122.65 rows=22 width=230)
          ->  Limit  (cost=0.00..6.18 rows=2 width=23)
                ->  Index Scan using ix_recursive_parent_orderer on t_recursive r  (cost=0.00..15.45 rows=5 width=23)
                      Index Cond: (parent = 0)
          ->  Limit  (cost=311.58..311.58 rows=2 width=23)
                ->  Sort  (cost=311.58..311.82 rows=98 width=23)
                      Sort Key: c.orderer
                      ->  Nested Loop  (cost=0.00..310.60 rows=98 width=23)
                            ->  WorkTable Scan on rows r  (cost=0.00..0.40 rows=20 width=4)
                            ->  Index Scan using ix_recursive_parent_orderer on t_recursive c  (cost=0.00..15.45 rows=5 width=23)
                                  Index Cond: (c.parent = r.id)

This query takes all children of the previous set and returns the first two of their descendants.

We see that PostgreSQL, unlike SQL Server, implements the recursive CTE's in truly set-based way.

The recursive part of the query can accept a set on input, return a set on output and do the set-based operations on the set received.

Written by Quassnoi

November 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED