Flattening timespans: PostgreSQL
Comments enabled. I *really* need your comment
In previous three articles:
, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.
Today, I'll describe how to do it in PostgreSQL.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.
Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.
We should create a set-returning function, open a cursor inside it and select all timespans ordered by start
, keeping track of a running maximum of stop
.
Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.
Here's the sample table:
CREATE TABLE t_span ( s_start TIMESTAMP NOT NULL, s_stop TIMESTAMP NOT NULL ); SELECT setseed(20090614); INSERT INTO t_span SELECT CAST('14.06.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval, CAST('14.06.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval + ((15 + RANDOM() * 300) || ' SECOND')::interval FROM generate_series (1, 1440000) s; ANALYZE t_span;
This table, as the previous ones, has 1,440,000 timespans starting every 30 seconds and having random lengths from 15 to 315 seconds.
Here's the function:
CREATE OR REPLACE FUNCTION fn_pgspans (f_from TIMESTAMP, f_till TIMESTAMP) RETURNS SETOF t_span AS $$ DECLARE r_start TIMESTAMP; r_stop TIMESTAMP; c_start TIMESTAMP; c_stop TIMESTAMP; r t_span%ROWTYPE; BEGIN FOR c_start, c_stop IN SELECT s_start, s_stop FROM t_span WHERE s_start BETWEEN f_from AND f_till ORDER BY s_start LOOP IF c_start > r_stop OR r_start IS NULL THEN IF c_start > r_stop THEN r := (r_start, r_stop); RETURN NEXT r; END IF; r_start := c_start; END IF; r_stop := GREATEST(c_stop, r_stop); END LOOP; IF r_stop IS NOT NULL THEN r := (r_start, r_stop); RETURN NEXT r; END IF; END; $$ LANGUAGE 'plpgsql';
and the result of its work:
SELECT * FROM fn_pgspans('01.01.2007', '01.01.2011');
s_start | s_stop | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-01-31 00:00:00 | 2008-02-06 15:30:24.877476 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-06 15:30:30 | 2008-02-09 19:31:48.126689 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-09 19:32:00 | 2008-02-10 09:16:52.176897 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-10 09:17:00 | 2008-02-12 04:15:28.368667 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-12 04:15:30 | 2008-02-12 13:53:57.344896 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-12 13:54:00 | 2008-02-21 23:03:22.158928 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-21 23:03:30 | 2008-02-28 20:08:25.323882 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-28 20:08:30 | 2008-03-03 04:15:52.322877 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-03 04:16:00 | 2008-03-16 20:44:25.529989 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-16 20:44:30 | 2008-03-16 20:59:59.279919 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-05-22 18:14:30 | 2009-06-03 07:27:57.630282 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-06-03 07:28:00 | 2009-06-12 13:17:27.147309 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-06-12 13:17:30 | 2009-06-14 00:03:16.929475 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 rows fetched in 0.0026s (14.3595s) |
Function Scan on fn_pgspans (cost=0.00..260.00 rows=1000 width=16)
This is quite fast, almost as fast as MySQL solution.
However, pl/PGSQL is not installed by default which can be a problem on shared hostings and source-distirbuted web applications.
In this case, we can use pure SQL functions.
Since SQL functions in PostgreSQL lack variables, we will use a temporary table to store intermediate results.
We will update the stop
column of the last row of the temporary table with the new value of the running maximum of stop
each time we fetch a timespan.
If the previous value of the running maximum is less than the current value of start
, we will insert this timespan as the new row into the table.
The row which used to be last will now hold the minimal value of start
from the previous range and the maximal value of the stop
. This is exactly what we need.
Finally, we will just select all rows.
Since the temporary tables have the scope of session, we will need to create three separate functions.
fn_spans()
, available from the outside, will create the table and call two other onesfn_countspans()
will be called repeatedly in a query an populate the temp tablefn_retspans()
will just return all rows from the temp table
Here are the functions:
CREATE TEMPORARY TABLE tt_span ( r_start TIMESTAMP NOT NULL, r_stop TIMESTAMP NOT NULL ); CREATE OR REPLACE FUNCTION fn_countspan (s_start TIMESTAMP, s_stop TIMESTAMP) RETURNS void AS $$ INSERT INTO tt_span SELECT $1, $2 WHERE $1 > COALESCE ( ( SELECT MAX(r_stop) FROM tt_span ), $1 - '1 second'::interval ); UPDATE tt_span s SET r_stop = GREATEST($2, r_stop) WHERE r_stop = ( SELECT MAX(r_stop) FROM tt_span ) AND r_start < $1; $$ LANGUAGE 'sql'; CREATE OR REPLACE FUNCTION fn_retspans() RETURNS SETOF t_span AS $$ SELECT r_start, r_stop FROM tt_span $$ LANGUAGE 'sql'; CREATE OR REPLACE FUNCTION fn_spans (f_from TIMESTAMP, f_till TIMESTAMP) RETURNS SETOF t_span AS $$ CREATE TEMPORARY TABLE tt_span ( r_start TIMESTAMP NOT NULL, r_stop TIMESTAMP NOT NULL ) ON COMMIT DROP; CREATE INDEX ixt_span_stop ON tt_span (r_stop); SELECT fn_countspan(s_start, s_stop) FROM ( SELECT * FROM t_span WHERE s_start BETWEEN $1 AND $2 ORDER BY s_start ) q; SELECT * FROM fn_retspans(); $$ LANGUAGE 'sql'; [/sourcecode] In <strong>PostgreSQL</strong>, temporary tables never persist. That's why we have to create an instance of <code>tt_span</code> for the functions to compile for the first time. They will later be called from <code>fn_spans()</code> which creates an instance of <code>tt_span</code> itself. And that's how we call the functions: SELECT * FROM fn_spans('01.01.2007', '01.01.2011');
s_start | s_stop | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-01-31 00:00:00 | 2008-02-06 15:30:24.877476 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-06 15:30:30 | 2008-02-09 19:31:48.126689 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-09 19:32:00 | 2008-02-10 09:16:52.176897 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-10 09:17:00 | 2008-02-12 04:15:28.368667 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-12 04:15:30 | 2008-02-12 13:53:57.344896 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-12 13:54:00 | 2008-02-21 23:03:22.158928 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-21 23:03:30 | 2008-02-28 20:08:25.323882 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-02-28 20:08:30 | 2008-03-03 04:15:52.322877 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-03 04:16:00 | 2008-03-16 20:44:25.529989 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-16 20:44:30 | 2008-03-16 20:59:59.279919 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-05-22 18:14:30 | 2009-06-03 07:27:57.630282 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-06-03 07:28:00 | 2009-06-12 13:17:27.147309 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-06-12 13:17:30 | 2009-06-14 00:03:16.929475 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 rows fetched in 0.0000s (46.5947s) |
Function Scan on fn_spans (cost=0.00..260.00 rows=1000 width=16)
This is of course longer than pl/PGSQL but may be the only way if the latter is not installed.