Flattening timespans: PostgreSQL at EXPLAIN EXTENDED

Flattening timespans: PostgreSQL

Comments enabled. I *really* need your comment

In previous three articles:

, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.

Today, I'll describe how to do it in PostgreSQL.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop

2009.06.03 05:54:48:000 10:00:13:000

2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop

2009.06.03 10:12:50:000 10:27:25:000

2009.06.03 10:13:12:000 11:14:56:000

2009.06.03 10:27:25:000 10:27:31:000

2009.06.03 10:27:39:000 13:53:38:000

2009.06.03 11:14:56:000 11:15:03:000

2009.06.03 11:15:30:000 14:02:14:000

2009.06.03 13:53:38:000 13:53:43:000

2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

date	start	stop
2009.06.03	05:54:48:000	10:00:13:000
2009.06.03	09:26:45:000	09:59:40:000

date	start	stop
2009.06.03	10:12:50:000	10:27:25:000
2009.06.03	10:13:12:000	11:14:56:000
2009.06.03	10:27:25:000	10:27:31:000
2009.06.03	10:27:39:000	13:53:38:000
2009.06.03	11:14:56:000	11:15:03:000
2009.06.03	11:15:30:000	14:02:14:000
2009.06.03	13:53:38:000	13:53:43:000
2009.06.03	14:02:14:000	14:02:31:000

PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.

Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.

We should create a set-returning function, open a cursor inside it and select all timespans ordered by start, keeping track of a running maximum of stop.

Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.

Here's the sample table:

CREATE TABLE t_span
        (
        s_start TIMESTAMP NOT NULL,
        s_stop TIMESTAMP NOT NULL
        );

SELECT setseed(20090614);

INSERT
INTO    t_span
SELECT  CAST('14.06.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval,
        CAST('14.06.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval  + ((15 + RANDOM() * 300) || ' SECOND')::interval
FROM    generate_series (1, 1440000) s;

ANALYZE t_span;

This table, as the previous ones, has 1,440,000 timespans starting every 30 seconds and having random lengths from 15 to 315 seconds.

Here's the function:

CREATE OR REPLACE FUNCTION fn_pgspans (f_from TIMESTAMP, f_till TIMESTAMP)
RETURNS SETOF t_span
AS
$$
DECLARE
        r_start TIMESTAMP;
        r_stop TIMESTAMP;
        c_start TIMESTAMP;
        c_stop TIMESTAMP;
        r t_span%ROWTYPE;
BEGIN
        FOR     c_start, c_stop
        IN
        SELECT  s_start, s_stop
        FROM    t_span
        WHERE   s_start BETWEEN f_from AND f_till
        ORDER BY
                s_start
        LOOP
                IF c_start > r_stop OR r_start IS NULL THEN
                        IF c_start > r_stop THEN
                                r := (r_start, r_stop);
                                RETURN NEXT r;
                        END IF;
                        r_start := c_start;
                END IF;
                r_stop := GREATEST(c_stop, r_stop);
        END LOOP;
        IF r_stop IS NOT NULL THEN
                r := (r_start, r_stop);
                RETURN NEXT r;
        END IF;
END;
$$
LANGUAGE 'plpgsql';

and the result of its work:

SELECT  *
FROM    fn_pgspans('01.01.2007', '01.01.2011');

s_start	s_stop
2008-01-31 00:00:00	2008-02-06 15:30:24.877476
2008-02-06 15:30:30	2008-02-09 19:31:48.126689
2008-02-09 19:32:00	2008-02-10 09:16:52.176897
2008-02-10 09:17:00	2008-02-12 04:15:28.368667
2008-02-12 04:15:30	2008-02-12 13:53:57.344896
2008-02-12 13:54:00	2008-02-21 23:03:22.158928
2008-02-21 23:03:30	2008-02-28 20:08:25.323882
2008-02-28 20:08:30	2008-03-03 04:15:52.322877
2008-03-03 04:16:00	2008-03-16 20:44:25.529989
2008-03-16 20:44:30	2008-03-16 20:59:59.279919

2009-05-22 18:14:30	2009-06-03 07:27:57.630282
2009-06-03 07:28:00	2009-06-12 13:17:27.147309
2009-06-12 13:17:30	2009-06-14 00:03:16.929475
92 rows fetched in 0.0026s (14.3595s)

Function Scan on fn_pgspans  (cost=0.00..260.00 rows=1000 width=16)

This is quite fast, almost as fast as MySQL solution.

However, pl/PGSQL is not installed by default which can be a problem on shared hostings and source-distirbuted web applications.

In this case, we can use pure SQL functions.

Since SQL functions in PostgreSQL lack variables, we will use a temporary table to store intermediate results.

We will update the stop column of the last row of the temporary table with the new value of the running maximum of stop each time we fetch a timespan.

If the previous value of the running maximum is less than the current value of start, we will insert this timespan as the new row into the table.

The row which used to be last will now hold the minimal value of start from the previous range and the maximal value of the stop. This is exactly what we need.

Finally, we will just select all rows.

Since the temporary tables have the scope of session, we will need to create three separate functions.

fn_spans(), available from the outside, will create the table and call two other ones
fn_countspans() will be called repeatedly in a query an populate the temp table
fn_retspans() will just return all rows from the temp table

Here are the functions:

CREATE TEMPORARY TABLE tt_span
(
        r_start TIMESTAMP NOT NULL,
        r_stop TIMESTAMP NOT NULL
);

CREATE OR REPLACE FUNCTION fn_countspan (s_start TIMESTAMP, s_stop TIMESTAMP)
RETURNS void
AS
$$
        INSERT
        INTO    tt_span
        SELECT  $1, $2
        WHERE   $1 >
                COALESCE
                (
                (
                SELECT  MAX(r_stop)
                FROM    tt_span
                ), $1 - '1 second'::interval
                );
        UPDATE  tt_span s
        SET     r_stop = GREATEST($2, r_stop)
        WHERE   r_stop =
                (
                SELECT  MAX(r_stop)
                FROM    tt_span
                )
                AND r_start < $1;
$$
LANGUAGE 'sql';

CREATE OR REPLACE FUNCTION fn_retspans()
RETURNS SETOF t_span
AS
$$
        SELECT  r_start, r_stop
        FROM    tt_span
$$
LANGUAGE 'sql';

CREATE OR REPLACE FUNCTION fn_spans (f_from TIMESTAMP, f_till TIMESTAMP)
RETURNS SETOF t_span
AS
$$
        CREATE TEMPORARY TABLE tt_span
        (
                r_start TIMESTAMP NOT NULL,
                r_stop TIMESTAMP NOT NULL
        ) ON COMMIT DROP;
        CREATE INDEX ixt_span_stop ON tt_span (r_stop);
        SELECT  fn_countspan(s_start, s_stop)
        FROM    (
                SELECT  *
                FROM    t_span
                WHERE   s_start BETWEEN $1 AND $2
                ORDER BY        
                        s_start
                ) q;
        SELECT  *
        FROM    fn_retspans();
$$
LANGUAGE 'sql';
&#91;/sourcecode&#93;

In <strong>PostgreSQL</strong>, temporary tables never persist. That's why we have to create an instance of <code>tt_span</code> for the functions to compile for the first time.

They will later be called from <code>fn_spans()</code> which creates an instance of <code>tt_span</code> itself.

And that's how we call the functions:


SELECT  *
FROM    fn_spans('01.01.2007', '01.01.2011');

s_start	s_stop
2008-01-31 00:00:00	2008-02-06 15:30:24.877476
2008-02-06 15:30:30	2008-02-09 19:31:48.126689
2008-02-09 19:32:00	2008-02-10 09:16:52.176897
2008-02-10 09:17:00	2008-02-12 04:15:28.368667
2008-02-12 04:15:30	2008-02-12 13:53:57.344896
2008-02-12 13:54:00	2008-02-21 23:03:22.158928
2008-02-21 23:03:30	2008-02-28 20:08:25.323882
2008-02-28 20:08:30	2008-03-03 04:15:52.322877
2008-03-03 04:16:00	2008-03-16 20:44:25.529989
2008-03-16 20:44:30	2008-03-16 20:59:59.279919

2009-05-22 18:14:30	2009-06-03 07:27:57.630282
2009-06-03 07:28:00	2009-06-12 13:17:27.147309
2009-06-12 13:17:30	2009-06-14 00:03:16.929475
92 rows fetched in 0.0000s (46.5947s)

Function Scan on fn_spans  (cost=0.00..260.00 rows=1000 width=16)

This is of course longer than pl/PGSQL but may be the only way if the latter is not installed.

Written by Quassnoi

June 14th, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED