PostgreSQL: parametrizing a recursive CTE
Answering questions asked on the site.
Jan Suchal asks:
We've started playing with PostgreSQL and recursive queries. Looking at example that does basic graph traversal from http://www.postgresql.org/docs/9.0/static/queries-with.html.
We would like to have a view called
paths
defined like this:WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path FROM search_graphBy calling
SELECT * FROM paths WHERE path[1] = :node_idwe would get all paths from a certain node.
The problem here is with performance. When you want this to be quick you need to add a condition for the anchor part of the
UNION
like this:WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node WHERE id = :node_id UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path FROM search_graphNow it's perfectly fast, but we cannot create a view because that would only contain paths from one specific node.
Any ideas?
An often overlooked feature of PostgreSQL is its ability to create set-returning functions and use them in SELECT
list.
The record will be cross-joined with the set returned by the function and the result of the join will be added to the resultset.
This is best demonstrated with generate_series
, probably a most used PostgreSQL set-returning function.
Emulating CROSS APPLY with set-returning functions
Let's write a simple query that returns numbers from 1 to 3:
SELECT id FROM ( VALUES (1), (2), (3) ) vals (id)
id |
---|
1 |
2 |
3 |
Now, let's add generate_series(1, id)
to the SELECT
list. As we can see, each of the three values is passed as an argument to generate_series
:
SELECT id, generate_series(1, id) FROM ( VALUES (1), (2), (3) ) vals (id)
id | generate_series |
---|---|
1 | 1 |
2 | 1 |
2 | 2 |
3 | 1 |
3 | 2 |
3 | 3 |
We see that each record of the original query was cross-joined with the set returned by the function, and the final resultset now has 6 records instead of 3, since 3 sets, having 1, 2 and 3 records, respectively, were results of these cross-joins.
This is almost what CROSS APPLY
does in SQL Server, but with some limitations and caveats.
Limitations of set-returning functions
- One should create a function explicitly. Anonymous blocks won't work.
- The functions should be written in SQL or C. Procedural language set-returning functions cannot be used in
SELECT
lists - If the function return an empty set, the result of the cross join will be an empty set too (i. e. the corresponding record won't be returned at all). This is exactly how
CROSS APPLY
behaves in SQL Server.However, the latter supports
OUTER APPLY
which always returns a single record with aNULL
in corresponding fields instead of the empty set, and it's impossible to emulate it in PostgreSQL using this syntax.A workaround would be to develop the function so that it would return a single
NULL
instead of an empty set. - The function should be marked
VOLATILE
so that the optimizer would execute it each time it's called. This is especially important when the function is used multiple times in the sameSELECT
list, like this:SELECT generate_series(1, 3), generate_series(1, 2)
generate_series generate_series 1 1 2 2 3 1 1 2 2 1 3 2 This query correctly returns 6 records.
SELECT generate_series(1, 3), generate_series(1, 3)
generate_series generate_series 1 1 2 2 3 3 This query incorrectly returns 3 records (it should return 9). This is because function is not reevaluated.
Parametrizing the path query
Now, we are informed enough to create our own function and test it. To do this, we will first create some sample tables:
There are 10,000 nodes with 9,996 random paths between them.
Here's what the function would look like:
CREATE OR REPLACE FUNCTION fn_search_graph_cte(parent BIGINT) RETURNS TABLE ( path BIGINT[] ) AS $$ WITH RECURSIVE search_graph(id, path) AS ( SELECT $1, ARRAY[$1] UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path FROM search_graph; $$ LANGUAGE 'sql' VOLATILE;
The anchor part of the recursive CTE is parametrized.
Paths from a single node
Now, let's compare performance of the queries that select paths from a single node:
WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path::TEXT FROM search_graph WHERE path[1] = 69
path | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
{69} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3804} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,3768} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683,8668} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683,8668,3705} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0002s (0.5156s) |
CTE Scan on search_graph (cost=151096.75..187130.22 rows=7999 width=32) Filter: (path[1] = 69) CTE search_graph -> Recursive Union (cost=0.00..151096.75 rows=1599710 width=40) -> Seq Scan on node (cost=0.00..164.00 rows=10000 width=8) -> Hash Join (cost=288.91..11893.85 rows=158971 width=40) Hash Cond: (sg.id = g.source) Join Filter: (g.dest <> ALL (sg.path)) -> WorkTable Scan on search_graph sg (cost=0.00..2000.00 rows=100000 width=40) -> Hash (cost=163.96..163.96 rows=9996 width=16) -> Seq Scan on graph g (cost=0.00..163.96 rows=9996 width=16)
This query materializes the whole CTE and then filters it for the paths beginning from 69. It takes more than 500 ms.
SELECT fn_search_graph_cte(id)::TEXT FROM node WHERE id = 69
fn_search_graph_cte | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
{69} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3804} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,3768} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683,8668} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
{69,3642,2925,5683,8668,3705} | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0003s (0.0030s) |
Index Scan using node_pkey on node (cost=0.00..8.52 rows=1 width=8) Index Cond: (id = 69)
This uses the function. Since the anchor part of the CTE has only one record, this is much faster and completes in 3 ms.
All paths
Let's check how long does it take to return all paths.
First, let's use the CTE:
WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT COUNT(*), AVG(ARRAY_LENGTH(path, 1)) FROM search_graph
count | avg | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
159522 | 10.3358157495517860 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.5469s) |
Aggregate (cost=191089.51..191089.52 rows=1 width=32) CTE search_graph -> Recursive Union (cost=0.00..151096.75 rows=1599710 width=40) -> Seq Scan on node (cost=0.00..164.00 rows=10000 width=8) -> Hash Join (cost=288.91..11893.85 rows=158971 width=40) Hash Cond: (sg.id = g.source) Join Filter: (g.dest <> ALL (sg.path)) -> WorkTable Scan on search_graph sg (cost=0.00..2000.00 rows=100000 width=40) -> Hash (cost=163.96..163.96 rows=9996 width=16) -> Seq Scan on graph g (cost=0.00..163.96 rows=9996 width=16) -> CTE Scan on search_graph (cost=0.00..31994.20 rows=1599710 width=32)
Now, the function:
SELECT COUNT(*), AVG(ARRAY_LENGTH(path, 1)) FROM ( SELECT fn_search_graph_cte(id) AS path FROM node ) q
count | avg | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
159522 | 10.3358157495517860 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.1093s) |
Aggregate (cost=2814.01..2814.02 rows=1 width=32) -> Seq Scan on node (cost=0.00..2664.00 rows=10000 width=8)
We see that the function introduces some overhead: it needs to be called 10,000 times and this takes 2 seconds (as opposed to 500 ms for the plain CTE). This means that the function is more efficient if the total number of the root nodes is less than 25% of all nodes.
Conclusion
A set-returning function is a good replacement for a view over a recursive CTE whose anchor part cannot be easily parametrized.
The sets returned by the function, when called in a SELECT
list, are cross-joined with the corresponding records. When the filter on the CTE is selective, the function will only be applied few times to the records satisfying the filter.
Call to a function, however, introduces some overhead which should be taken into account. In some cases, when a filter is not very selective, it is better to use the benefits of set-based operations to generate the view and then filter it than to call a function multiple times, adding the overhead of each call.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Hi,
Really interesting stuff, but one comment; the link to the “table creation details” seems broken, it links back to this page. Please could you post the table creation details so I can try this for myself.
Thanks!
–Robin
Robin
15 May 11 at 19:09
@Robin: try clicking the link in any modern browser. It will show the code right in the browser window, under the link.
Quassnoi
16 May 11 at 09:15