Flattening timespans: PostgreSQL 8.4 at EXPLAIN EXTENDED

Flattening timespans: PostgreSQL 8.4

On Jul 1, 2009, PostgreSQL 8.4 was released.

Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE's).

Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer's life much, much easier.

Let's check how efficienly these features are implemented.

To do this, I'll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.

I'll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:

Flattening timespans: PostgreSQL

This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop

2009.06.03 05:54:48:000 10:00:13:000

2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop

2009.06.03 10:12:50:000 10:27:25:000

2009.06.03 10:13:12:000 11:14:56:000

2009.06.03 10:27:25:000 10:27:31:000

2009.06.03 10:27:39:000 13:53:38:000

2009.06.03 11:14:56:000 11:15:03:000

2009.06.03 11:15:30:000 14:02:14:000

2009.06.03 13:53:38:000 13:53:43:000

2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

date	start	stop
2009.06.03	05:54:48:000	10:00:13:000
2009.06.03	09:26:45:000	09:59:40:000

date	start	stop
2009.06.03	10:12:50:000	10:27:25:000
2009.06.03	10:13:12:000	11:14:56:000
2009.06.03	10:27:25:000	10:27:31:000
2009.06.03	10:27:39:000	13:53:38:000
2009.06.03	11:14:56:000	11:15:03:000
2009.06.03	11:15:30:000	14:02:14:000
2009.06.03	13:53:38:000	13:53:43:000
2009.06.03	14:02:14:000	14:02:31:000

Let's create a sample table:

CREATE TABLE t_span
        (
        s_start TIMESTAMP NOT NULL,
        s_end TIMESTAMP NOT NULL
        );

CREATE INDEX ix_span_start ON t_span (s_start);

SELECT setseed(0.20090714);

INSERT
INTO    t_span
SELECT  CAST('14.07.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval,
        CAST('14.07.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval  + ((15 + RANDOM() * 300) || ' SECOND')::interval
FROM    generate_series (1, 1440000) s;

ANALYZE t_span;

To solve this task in PostgreSQL 8.4, I'll take a query that I previously wrote to do the same thing in Oracle:

Flattening timespans: Oracle

This query runs almost out of the box, with two slight changes: I just had to replace NVL with COALESCE and alias the subqueries.

Since both these things are required by ANSI standard, I think it's acceptable.

Here's the query:

WITH    q AS
        (
        SELECT  s_start,
                s_end,
                MAX(s_end) OVER (ORDER BY s_start) AS s_maxend,
                ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn
        FROM    t_span s
        ORDER BY
                s_start
        )
SELECT  q1.s_start,
        q2.s_end
FROM    (
        SELECT  COALESCE(LAG(s_rn) OVER (ORDER BY s_rn), 1) AS s_prn, s_rn - 1 AS s_crn
        FROM    (
                SELECT  s_rn
                FROM    (
                        SELECT  q.*, LAG(s_maxend) OVER (ORDER BY s_rn) AS s_maxpend
                        FROM    q
                        ) qo
                WHERE   s_start > s_maxpend
                UNION ALL
                SELECT  MAX(s_rn) + 1
                FROM    q
                ) qo
        ) g
JOIN    q q1
ON      q1.s_rn = g.s_prn
JOIN    q q2
ON      q2.s_rn = g.s_crn
ORDER BY
        q1.s_start

s_start	s_end
2008-03-01 00:00:00	2008-03-02 04:56:49.659754
2008-03-02 04:57:00	2008-03-07 12:29:49.143806
2008-03-07 12:30:00	2008-03-08 20:56:28.941583
2008-03-08 20:56:30	2008-03-11 21:20:58.638719
2008-03-11 21:21:00	2008-03-20 13:51:15.281918
2008-03-20 13:51:30	2008-03-30 15:54:59.841784
2008-03-30 15:55:00	2008-04-02 13:53:45.933431
2008-04-02 13:54:00	2008-04-13 16:20:48.130234
2008-04-13 16:21:00	2008-04-22 23:24:19.521762

2009-07-06 19:11:00	2009-07-07 00:05:16.17393
2009-07-07 00:05:30	2009-07-11 10:38:45.594804
2009-07-11 10:39:00	2009-07-14 00:04:44.19723
95 rows fetched in 0.0030s (17.6874s)

Sort  (cost=11013843991026.27..11076052120626.27 rows=24883251840000 width=16)
  Sort Key: q1.s_start
  CTE q
    ->  WindowAgg  (cost=0.00..81387.52 rows=1440000 width=16)
          ->  Index Scan using ix_span_start on t_span s  (cost=0.00..56187.52 rows=1440000 width=16)
  ->  Merge Join  (cost=1013778667.64..374279850703.64 rows=24883251840000 width=16)
        Merge Cond: (g.s_prn = q1.s_rn)
        ->  Sort  (cost=1013553348.65..1022193366.65 rows=3456007200 width=16)
              Sort Key: g.s_prn
              ->  Merge Join  (cost=719705.08..52571813.09 rows=3456007200 width=16)
                    Merge Cond: (q2.s_rn = g.s_crn)
                    ->  Sort  (cost=225318.99..228918.99 rows=1440000 width=16)
                          Sort Key: q2.s_rn
                          ->  CTE Scan on q q2  (cost=0.00..28800.00 rows=1440000 width=16)
                    ->  Materialize  (cost=494386.09..500386.10 rows=480001 width=16)
                          ->  Sort  (cost=494386.09..495586.09 rows=480001 width=16)
                                Sort Key: g.s_crn
                                ->  Subquery Scan g  (cost=418283.54..432683.57 rows=480001 width=16)
                                      ->  WindowAgg  (cost=418283.54..427883.56 rows=480001 width=8)
                                            ->  Sort  (cost=418283.54..419483.54 rows=480001 width=8)
                                                  Sort Key: qo.s_rn
                                                  ->  Result  (cost=279456.99..359857.01 rows=480001 width=8)
                                                        ->  Append  (cost=279456.99..359857.01 rows=480001 width=8)
                                                              ->  Subquery Scan qo  (cost=279456.99..322656.99 rows=480000 width=8)
                                                                    Filter: (qo.s_start > qo.s_maxpend)
                                                                    ->  WindowAgg  (cost=279456.99..304656.99 rows=1440000 width=32)
                                                                          ->  Sort  (cost=279456.99..283056.99 rows=1440000 width=32)
                                                                                Sort Key: q.s_rn
                                                                                ->  CTE Scan on q  (cost=0.00..28800.00 rows=1440000 width=32)
                                                              ->  Aggregate  (cost=32400.00..32400.01 rows=1 width=8)
                                                                    ->  CTE Scan on q  (cost=0.00..28800.00 rows=1440000 width=8)
        ->  Materialize  (cost=225318.99..243318.99 rows=1440000 width=16)
              ->  Sort  (cost=225318.99..228918.99 rows=1440000 width=16)
                    Sort Key: q1.s_rn
                    ->  CTE Scan on q q1  (cost=0.00..28800.00 rows=1440000 width=16)

The query from above works (which is nice), works quite fast (which is also nice) but still performs more poorly than a pl/pgSQL solution using cursors. The pl/pgSQL solution takes but 3.6 seconds:

CREATE OR REPLACE FUNCTION fn_pgspans (f_from TIMESTAMP, f_till TIMESTAMP)
RETURNS SETOF t_span
AS
$$
DECLARE
        r_start TIMESTAMP;
        r_end TIMESTAMP;
        c_start TIMESTAMP;
        c_end TIMESTAMP;
        r t_span%ROWTYPE;
BEGIN
        FOR     c_start, c_end
        IN
        SELECT  s_start, s_end
        FROM    t_span
        WHERE   s_start BETWEEN f_from AND f_till
        ORDER BY
                s_start
        LOOP
                IF c_start > r_end OR r_start IS NULL THEN
                        IF c_start > r_end THEN
                                r := (r_start, r_end);
                                RETURN NEXT r;
                        END IF;
                        r_start := c_start;
                END IF;
                r_end := GREATEST(c_end, r_end);
        END LOOP;
        IF r_end IS NOT NULL THEN
                r := (r_start, r_end);
                RETURN NEXT r;
        END IF;
END;
$$
LANGUAGE 'plpgsql';

SELECT  *
FROM    fn_pgspans('01.01.2007', '01.01.2011');

s_start	s_end
2008-03-01 00:00:00	2008-03-02 04:56:56.228301
2008-03-02 04:57:00	2008-03-07 12:29:49.143806
2008-03-07 12:30:00	2008-03-08 20:56:28.941583
2008-03-08 20:56:30	2008-03-11 21:20:58.638719
2008-03-11 21:21:00	2008-03-20 13:51:19.202829
2008-03-20 13:51:30	2008-03-30 15:54:59.841784
2008-03-30 15:55:00	2008-04-02 13:53:53.391648
2008-04-02 13:54:00	2008-04-13 16:20:48.130234
2008-04-13 16:21:00	2008-04-22 23:24:27.941262

2009-07-06 19:11:00	2009-07-07 00:05:29.434784
2009-07-07 00:05:30	2009-07-11 10:38:50.779064
2009-07-11 10:39:00	2009-07-14 00:04:44.19723
95 rows fetched in 0.0028s (3.5938s)

Function Scan on fn_pgspans  (cost=0.00..260.00 rows=1000 width=16)

Why so?

If we look closer into the plan, we will see three sorts of same resultset (q) on the same field (s_rn)

Two of them are required to make MERGE JOINs, the third one to calculate the LAG(s_maxend) OVER (s_rn).

It's not clear why a MERGE JOIN is chosen to join there resutlsets.

On one hand, it's the best method for joining two presorted resultsets, on the other hand, the resultsets are being sorted three times anyway.

Sorting resultset once on s_rn and materializing the sorted results would make these MERGE JOINs much more efficient. And as a bonus, the conditionless MAX(s_rn) used in a UNION query could be retrieved just by taking the last result from the resultset.

However, 17 seconds is really very fast, almost as fast as the function-based solution in the previous version.

The current version's function-based solution runs for only 3.6 seconds and beats the former champion, MySQL, hands down. MySQL solution took 11 seconds.

Having window functions, working right as they should, is super nice.

A great work of PostgreSQL team that new version is.

To be continued.

Written by Quassnoi

July 14th, 2009 at 11:00 pm

Posted in PostgreSQL

One Response to 'Flattening timespans: PostgreSQL 8.4'

Subscribe to comments with RSS

[…] https://explainextended.com/2009/07/14/flattening-timespans-postgresql-8-4/ […]

Flattening timespans: PostgreSQL 8.4 - DbRunas

15 Jul 09 at 06:56

EXPLAIN EXTENDED