PostgreSQL: parametrizing a recursive CTE at EXPLAIN EXTENDED

PostgreSQL: parametrizing a recursive CTE

Answering questions asked on the site.

Jan Suchal asks:

We've started playing with PostgreSQL and recursive queries. Looking at example that does basic graph traversal from http://www.postgresql.org/docs/9.0/static/queries-with.html.

We would like to have a view called paths defined like this:
WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph
By calling
SELECT  *
FROM    paths
WHERE   path[1] = :node_id
we would get all paths from a certain node.

The problem here is with performance. When you want this to be quick you need to add a condition for the anchor part of the UNION like this:
WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
WHERE   id = :node_id
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph
Now it's perfectly fast, but we cannot create a view because that would only contain paths from one specific node.

Any ideas?

An often overlooked feature of PostgreSQL is its ability to create set-returning functions and use them in SELECT list.

The record will be cross-joined with the set returned by the function and the result of the join will be added to the resultset.

This is best demonstrated with generate_series, probably a most used PostgreSQL set-returning function.

Emulating CROSS APPLY with set-returning functions

Let's write a simple query that returns numbers from 1 to 3:

SELECT  id
FROM    (
VALUES
(1),
(2),
(3)
) vals (id)

id
1
2
3

Now, let's add generate_series(1, id) to the SELECT list. As we can see, each of the three values is passed as an argument to generate_series:

SELECT  id, generate_series(1, id)
FROM    (
VALUES
(1),
(2),
(3)
) vals (id)

id	generate_series
1	1
2	1
2	2
3	1
3	2
3	3

We see that each record of the original query was cross-joined with the set returned by the function, and the final resultset now has 6 records instead of 3, since 3 sets, having 1, 2 and 3 records, respectively, were results of these cross-joins.

This is almost what CROSS APPLY does in SQL Server, but with some limitations and caveats.

Limitations of set-returning functions

One should create a function explicitly. Anonymous blocks won't work.
The functions should be written in SQL or C. Procedural language set-returning functions cannot be used in SELECT lists
If the function return an empty set, the result of the cross join will be an empty set too (i. e. the corresponding record won't be returned at all). This is exactly how CROSS APPLY behaves in SQL Server.
However, the latter supports OUTER APPLY which always returns a single record with a NULL in corresponding fields instead of the empty set, and it's impossible to emulate it in PostgreSQL using this syntax.

A workaround would be to develop the function so that it would return a single NULL instead of an empty set.
The function should be marked VOLATILE so that the optimizer would execute it each time it's called. This is especially important when the function is used multiple times in the same SELECT list, like this:
```
SELECT  generate_series(1, 3), generate_series(1, 2)
```
generate_series generate_series

1 1

2 2

3 1

1 2

2 1

3 2

This query correctly returns 6 records.
```
SELECT  generate_series(1, 3), generate_series(1, 3)
```
generate_series generate_series

1 1

2 2

3 3

This query incorrectly returns 3 records (it should return 9). This is because function is not reevaluated.

Parametrizing the path query

Now, we are informed enough to create our own function and test it. To do this, we will first create some sample tables:

Table creation details

CREATE TABLE node
(
id BIGINT NOT NULL PRIMARY KEY,
name VARCHAR(100) NOT NULL
);

CREATE TABLE graph
(
source BIGINT NOT NULL,
dest BIGINT NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (source, dest)
);

CREATE INDEX
ix_graph_dest
ON      graph (dest);

SELECT  SETSEED(0.20101224);

INSERT
INTO    node
SELECT  s, 'Node ' || s
FROM    generate_series(1, 10000) s;

INSERT
INTO    graph (source, dest, data)
SELECT  source, dest, RANDOM()
FROM    (
SELECT  DISTINCT
CEIL(RANDOM() * 10000) AS source,
CEIL(RANDOM() * 10000) AS dest
FROM    generate_series(1, 10000)
) q
WHERE   source &lt;&gt; dest

There are 10,000 nodes with 9,996 random paths between them.

Here's what the function would look like:

CREATE OR REPLACE FUNCTION fn_search_graph_cte(parent BIGINT)
RETURNS TABLE
(
path BIGINT[]
)
AS
$$
WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  $1, ARRAY[$1]
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph;
$$
LANGUAGE 'sql'
VOLATILE;

The anchor part of the recursive CTE is parametrized.

Paths from a single node

Now, let's compare performance of the queries that select paths from a single node:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path::TEXT
FROM    search_graph
WHERE   path[1] = 69

path
{69}
{69,3804}
{69,3642}
{69,3642,3768}
{69,3642,2925}
{69,3642,2925,5683}
{69,3642,2925,5683,8668}
{69,3642,2925,5683,8668,3705}
8 rows fetched in 0.0002s (0.5156s)

CTE Scan on search_graph  (cost=151096.75..187130.22 rows=7999 width=32)
  Filter: (path[1] = 69)
  CTE search_graph
    ->  Recursive Union  (cost=0.00..151096.75 rows=1599710 width=40)
          ->  Seq Scan on node  (cost=0.00..164.00 rows=10000 width=8)
          ->  Hash Join  (cost=288.91..11893.85 rows=158971 width=40)
                Hash Cond: (sg.id = g.source)
                Join Filter: (g.dest <> ALL (sg.path))
                ->  WorkTable Scan on search_graph sg  (cost=0.00..2000.00 rows=100000 width=40)
                ->  Hash  (cost=163.96..163.96 rows=9996 width=16)
                      ->  Seq Scan on graph g  (cost=0.00..163.96 rows=9996 width=16)

This query materializes the whole CTE and then filters it for the paths beginning from 69. It takes more than 500 ms.

SELECT  fn_search_graph_cte(id)::TEXT
FROM    node
WHERE   id = 69

fn_search_graph_cte
{69}
{69,3642}
{69,3804}
{69,3642,2925}
{69,3642,3768}
{69,3642,2925,5683}
{69,3642,2925,5683,8668}
{69,3642,2925,5683,8668,3705}
8 rows fetched in 0.0003s (0.0030s)

Index Scan using node_pkey on node  (cost=0.00..8.52 rows=1 width=8)
  Index Cond: (id = 69)

This uses the function. Since the anchor part of the CTE has only one record, this is much faster and completes in 3 ms.

All paths

Let's check how long does it take to return all paths.

First, let's use the CTE:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  COUNT(*), AVG(ARRAY_LENGTH(path, 1))
FROM    search_graph

count	avg
159522	10.3358157495517860
1 row fetched in 0.0001s (0.5469s)

Aggregate  (cost=191089.51..191089.52 rows=1 width=32)
  CTE search_graph
    ->  Recursive Union  (cost=0.00..151096.75 rows=1599710 width=40)
          ->  Seq Scan on node  (cost=0.00..164.00 rows=10000 width=8)
          ->  Hash Join  (cost=288.91..11893.85 rows=158971 width=40)
                Hash Cond: (sg.id = g.source)
                Join Filter: (g.dest <> ALL (sg.path))
                ->  WorkTable Scan on search_graph sg  (cost=0.00..2000.00 rows=100000 width=40)
                ->  Hash  (cost=163.96..163.96 rows=9996 width=16)
                      ->  Seq Scan on graph g  (cost=0.00..163.96 rows=9996 width=16)
  ->  CTE Scan on search_graph  (cost=0.00..31994.20 rows=1599710 width=32)

Now, the function:

SELECT  COUNT(*), AVG(ARRAY_LENGTH(path, 1))
FROM    (
SELECT  fn_search_graph_cte(id) AS path
FROM    node
) q

count	avg
159522	10.3358157495517860
1 row fetched in 0.0001s (2.1093s)

Aggregate  (cost=2814.01..2814.02 rows=1 width=32)
  ->  Seq Scan on node  (cost=0.00..2664.00 rows=10000 width=8)

We see that the function introduces some overhead: it needs to be called 10,000 times and this takes 2 seconds (as opposed to 500 ms for the plain CTE). This means that the function is more efficient if the total number of the root nodes is less than 25% of all nodes.

Conclusion

A set-returning function is a good replacement for a view over a recursive CTE whose anchor part cannot be easily parametrized.

The sets returned by the function, when called in a SELECT list, are cross-joined with the corresponding records. When the filter on the CTE is selective, the function will only be applied few times to the records satisfying the filter.

Call to a function, however, introduces some overhead which should be taken into account. In some cases, when a filter is not very selective, it is better to use the benefits of set-based operations to generate the view and then filter it than to call a function multiple times, adding the overhead of each call.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

December 24th, 2010 at 11:00 pm

Posted in PostgreSQL

2 Responses to 'PostgreSQL: parametrizing a recursive CTE'

Subscribe to comments with RSS

Hi,

Really interesting stuff, but one comment; the link to the “table creation details” seems broken, it links back to this page. Please could you post the table creation details so I can try this for myself.

Thanks!

–Robin

Robin

15 May 11 at 19:09
@Robin: try clicking the link in any modern browser. It will show the code right in the browser window, under the link.

Quassnoi

16 May 11 at 09:15

EXPLAIN EXTENDED