EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL 8.4: preserving order for hierarchical query

with 3 comments

On Jul 1, 2009, PostgreSQL 8.4 was released.

In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.

Previously in the series:

Now, let's see how we can implement the hierarchical queries using the new features of PostgreSQL 8.4.

In PostgreSQL 8.3, we had to create a recursive function to do that. If you are still bound to 8.3 or an earlier version, you can read this article to see how to do it:

In 8.4, we have recursive CTE's (common table expressions).

Let's create a sample hierarchical table and see how can we query it:

CREATE TABLE t_hierarchy (
id INT NOT NULL PRIMARY KEY,
parent INT NOT NULL,
value VARCHAR(50) NOT NULL
);

CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent);

BEGIN TRANSACTION;
INSERT
INTO    t_hierarchy (id, parent, value)
SELECT  s, s / 5, 'Value ' || s
FROM    generate_series(1, 1000) s;

COMMIT;

ANALYZE t_hierarchy;

There are 1,000 rows in the table, each row being a child to a row with id 5 times as low as the current one (i. e. rows 500, 501, 502, 503 and 504 are all children of row 100)

In the previous versions, we had to create a type and a recursive function:

CREATE TYPE tp_hierarchy AS (node t_hierarchy, level INT);

CREATE OR REPLACE FUNCTION fn_hierarchy_connect_by(INT, INT)
RETURNS SETOF tp_hierarchy
AS
$$
SELECT  CASE
WHEN node = 1 THEN
(t_hierarchy, $2)::tp_hierarchy
ELSE
fn_hierarchy_connect_by((q.t_hierarchy).id, $2 + 1)
END
FROM    (
SELECT  t_hierarchy, node
FROM    (
SELECT  1 AS node
UNION ALL
SELECT  2
) nodes,
t_hierarchy
WHERE   parent = $1
ORDER BY
id, node
) q;
$$
LANGUAGE 'sql';

This function gave us the following results:

SELECT  REPEAT('  ', (q.h).level) || ((q.h).node).id,
((q.h).node).parent,
((q.h).node).value,
(q.h).level
FROM    (
SELECT  fn_hierarchy_connect_by(0, 1) AS h
) AS q

?column? parent value level
1 0 Value 1 1
5 1 Value 5 2
25 5 Value 25 3
125 25 Value 125 4
625 125 Value 625 5
626 125 Value 626 5
627 125 Value 627 5
628 125 Value 628 5
629 125 Value 629 5
126 25 Value 126 4
630 126 Value 630 5
631 126 Value 631 5
619 123 Value 619 4
124 24 Value 124 3
620 124 Value 620 4
621 124 Value 621 4
622 124 Value 622 4
623 124 Value 623 4
624 124 Value 624 4
1000 rows fetched in 0.0380s (0.1636s)
Subquery Scan q  (cost=0.00..0.28 rows=1 width=32)
  ->  Result  (cost=0.00..0.26 rows=1 width=0)

It runs quite fast, only 160 ms, and returns the rows in correct order.

In the current version, we don't need a function. Instead, we can use a recursive CTE:

WITH RECURSIVE q AS
(
SELECT  h, 1 AS level
FROM    t_hierarchy h
WHERE   parent = 0
UNION ALL
SELECT  hi, q.level + 1 AS level
FROM    q
JOIN    t_hierarchy hi
ON      hi.parent = (q.h).id
)
SELECT  REPEAT('  ', level) || (q.h).id,
(q.h).parent,
(q.h).value,
level
FROM    q

?column? parent value level
4 0 Value 4 1
3 0 Value 3 1
2 0 Value 2 1
1 0 Value 1 1
5 1 Value 5 2
6 1 Value 6 2
7 1 Value 7 2
8 1 Value 8 2
9 1 Value 9 2
10 2 Value 10 2
11 2 Value 11 2
998 199 Value 998 5
999 199 Value 999 5
1000 200 Value 1000 5
1000 rows fetched in 0.0386s (0.0166s)
CTE Scan on q  (cost=313.11..387.96 rows=2495 width=36)
  CTE q
    ->  Recursive Union  (cost=0.00..313.11 rows=2495 width=36)
          ->  Index Scan using ix_hierarchy_parent on t_hierarchy h  (cost=0.00..8.34 rows=5 width=32)
                Index Cond: (parent = 0)
          ->  Hash Join  (cost=1.63..25.49 rows=249 width=36)
                Hash Cond: (hi.parent = (q.h).id)
                ->  Seq Scan on t_hierarchy hi  (cost=0.00..17.00 rows=1000 width=36)
                ->  Hash  (cost=1.00..1.00 rows=50 width=36)
                      ->  WorkTable Scan on q  (cost=0.00..1.00 rows=50 width=36)

This query is very fast (only 40 ms).

However, it does not preserve correct hierarchical order.

This is because the resursive CTE does not return the rows in recursive order. Rather, it first select all rows for the current recursion level, and then executes the CTE for each of the rows, UNION'ing the result.

To select rows in correct order, we need some field we could sort on. Since we need a lexicographic sort here, no base type can be easily used to do such a sort.

Fortunately, PostgreSQL offers ARRAY datatype which is lexicographically comparable and hence is just what we need.

We will add an additional int[] field, called breadcrumb, to the results of the CTE, which will accumulate the parent ids in an ARRAY of integers.

Then we will just sort on this field.

Here's the query:

WITH RECURSIVE q AS
(
SELECT  h, 1 AS level, ARRAY[id] AS breadcrumb
FROM    t_hierarchy h
WHERE   parent = 0
UNION ALL
SELECT  hi, q.level + 1 AS level, breadcrumb || id
FROM    q
JOIN    t_hierarchy hi
ON      hi.parent = (q.h).id
)
SELECT  REPEAT('  ', level) || (q.h).id,
(q.h).parent,
(q.h).value,
level,
breadcrumb::VARCHAR AS path
FROM    q
ORDER BY
breadcrumb

?column? parent value level path
1 0 Value 1 1 {1}
5 1 Value 5 2 {1,5}
25 5 Value 25 3 {1,5,25}
125 25 Value 125 4 {1,5,25,125}
625 125 Value 625 5 {1,5,25,125,625}
626 125 Value 626 5 {1,5,25,125,626}
627 125 Value 627 5 {1,5,25,125,627}
628 125 Value 628 5 {1,5,25,125,628}
629 125 Value 629 5 {1,5,25,125,629}
126 25 Value 126 4 {1,5,25,126}
630 126 Value 630 5 {1,5,25,126,630}
631 126 Value 631 5 {1,5,25,126,631}
124 24 Value 124 3 {4,24,124}
620 124 Value 620 4 {4,24,124,620}
621 124 Value 621 4 {4,24,124,621}
622 124 Value 622 4 {4,24,124,622}
623 124 Value 623 4 {4,24,124,623}
624 124 Value 624 4 {4,24,124,624}
1000 rows fetched in 0.0648s (0.0270s)
Sort  (cost=547.44..553.68 rows=2495 width=68)
  Sort Key: q.breadcrumb
  CTE q
    ->  Recursive Union  (cost=0.00..319.34 rows=2495 width=72)
          ->  Index Scan using ix_hierarchy_parent on t_hierarchy h  (cost=0.00..8.34 rows=5 width=36)
                Index Cond: (parent = 0)
          ->  Hash Join  (cost=1.63..26.11 rows=249 width=72)
                Hash Cond: (hi.parent = (q.h).id)
                ->  Seq Scan on t_hierarchy hi  (cost=0.00..17.00 rows=1000 width=40)
                ->  Hash  (cost=1.00..1.00 rows=50 width=68)
                      ->  WorkTable Scan on q  (cost=0.00..1.00 rows=50 width=68)
  ->  CTE Scan on q  (cost=0.00..87.33 rows=2495 width=68)

This query is also almost instant and returns the hierarchical data from the CTE in correct order.

As a bonus, we get what Oracle call SYS_CONNECT_BY_PATH (a complete descendancy chain for each row) for free.

This new feature is very nice and a real boon for every PostgreSQL developer.

Written by Quassnoi

July 17th, 2009 at 11:00 pm

Posted in PostgreSQL

3 Responses to 'PostgreSQL 8.4: preserving order for hierarchical query'

Subscribe to comments with RSS

  1. How about case when id set for some level is {1,2,…,10}. Result of order by breadcrump will be 1,10,2…?

    Alessio

    5 Feb 13 at 20:56

  2. @Alessio: no.

    Quassnoi

    14 Mar 13 at 00:26

  3. This is exellent ! Thank you for putting this together. Very helpful.

    SkipDigit

    20 Dec 19 at 03:44

Leave a Reply