Flattening timespans: PostgreSQL 8.4
On Jul 1, 2009, PostgreSQL 8.4 was released.
Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE's).
Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer's life much, much easier.
Let's check how efficienly these features are implemented.
To do this, I'll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.
I'll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:
This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
Let's create a sample table:
CREATE TABLE t_span ( s_start TIMESTAMP NOT NULL, s_end TIMESTAMP NOT NULL ); CREATE INDEX ix_span_start ON t_span (s_start); SELECT setseed(0.20090714); INSERT INTO t_span SELECT CAST('14.07.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval, CAST('14.07.2009' AS TIMESTAMP) - ((s * 30) || ' SECOND')::interval + ((15 + RANDOM() * 300) || ' SECOND')::interval FROM generate_series (1, 1440000) s; ANALYZE t_span;
To solve this task in PostgreSQL 8.4, I'll take a query that I previously wrote to do the same thing in Oracle:
This query runs almost out of the box, with two slight changes: I just had to replace NVL
with COALESCE
and alias the subqueries.
Since both these things are required by ANSI standard, I think it's acceptable.
Here's the query:
WITH q AS ( SELECT s_start, s_end, MAX(s_end) OVER (ORDER BY s_start) AS s_maxend, ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn FROM t_span s ORDER BY s_start ) SELECT q1.s_start, q2.s_end FROM ( SELECT COALESCE(LAG(s_rn) OVER (ORDER BY s_rn), 1) AS s_prn, s_rn - 1 AS s_crn FROM ( SELECT s_rn FROM ( SELECT q.*, LAG(s_maxend) OVER (ORDER BY s_rn) AS s_maxpend FROM q ) qo WHERE s_start > s_maxpend UNION ALL SELECT MAX(s_rn) + 1 FROM q ) qo ) g JOIN q q1 ON q1.s_rn = g.s_prn JOIN q q2 ON q2.s_rn = g.s_crn ORDER BY q1.s_start
s_start | s_end | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-03-01 00:00:00 | 2008-03-02 04:56:49.659754 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-02 04:57:00 | 2008-03-07 12:29:49.143806 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-07 12:30:00 | 2008-03-08 20:56:28.941583 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-08 20:56:30 | 2008-03-11 21:20:58.638719 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-11 21:21:00 | 2008-03-20 13:51:15.281918 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-20 13:51:30 | 2008-03-30 15:54:59.841784 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-30 15:55:00 | 2008-04-02 13:53:45.933431 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-04-02 13:54:00 | 2008-04-13 16:20:48.130234 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-04-13 16:21:00 | 2008-04-22 23:24:19.521762 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-06 19:11:00 | 2009-07-07 00:05:16.17393 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-07 00:05:30 | 2009-07-11 10:38:45.594804 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-11 10:39:00 | 2009-07-14 00:04:44.19723 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 rows fetched in 0.0030s (17.6874s) |
Sort (cost=11013843991026.27..11076052120626.27 rows=24883251840000 width=16) Sort Key: q1.s_start CTE q -> WindowAgg (cost=0.00..81387.52 rows=1440000 width=16) -> Index Scan using ix_span_start on t_span s (cost=0.00..56187.52 rows=1440000 width=16) -> Merge Join (cost=1013778667.64..374279850703.64 rows=24883251840000 width=16) Merge Cond: (g.s_prn = q1.s_rn) -> Sort (cost=1013553348.65..1022193366.65 rows=3456007200 width=16) Sort Key: g.s_prn -> Merge Join (cost=719705.08..52571813.09 rows=3456007200 width=16) Merge Cond: (q2.s_rn = g.s_crn) -> Sort (cost=225318.99..228918.99 rows=1440000 width=16) Sort Key: q2.s_rn -> CTE Scan on q q2 (cost=0.00..28800.00 rows=1440000 width=16) -> Materialize (cost=494386.09..500386.10 rows=480001 width=16) -> Sort (cost=494386.09..495586.09 rows=480001 width=16) Sort Key: g.s_crn -> Subquery Scan g (cost=418283.54..432683.57 rows=480001 width=16) -> WindowAgg (cost=418283.54..427883.56 rows=480001 width=8) -> Sort (cost=418283.54..419483.54 rows=480001 width=8) Sort Key: qo.s_rn -> Result (cost=279456.99..359857.01 rows=480001 width=8) -> Append (cost=279456.99..359857.01 rows=480001 width=8) -> Subquery Scan qo (cost=279456.99..322656.99 rows=480000 width=8) Filter: (qo.s_start > qo.s_maxpend) -> WindowAgg (cost=279456.99..304656.99 rows=1440000 width=32) -> Sort (cost=279456.99..283056.99 rows=1440000 width=32) Sort Key: q.s_rn -> CTE Scan on q (cost=0.00..28800.00 rows=1440000 width=32) -> Aggregate (cost=32400.00..32400.01 rows=1 width=8) -> CTE Scan on q (cost=0.00..28800.00 rows=1440000 width=8) -> Materialize (cost=225318.99..243318.99 rows=1440000 width=16) -> Sort (cost=225318.99..228918.99 rows=1440000 width=16) Sort Key: q1.s_rn -> CTE Scan on q q1 (cost=0.00..28800.00 rows=1440000 width=16)
The query from above works (which is nice), works quite fast (which is also nice) but still performs more poorly than a pl/pgSQL solution using cursors. The pl/pgSQL solution takes but 3.6 seconds:
CREATE OR REPLACE FUNCTION fn_pgspans (f_from TIMESTAMP, f_till TIMESTAMP) RETURNS SETOF t_span AS $$ DECLARE r_start TIMESTAMP; r_end TIMESTAMP; c_start TIMESTAMP; c_end TIMESTAMP; r t_span%ROWTYPE; BEGIN FOR c_start, c_end IN SELECT s_start, s_end FROM t_span WHERE s_start BETWEEN f_from AND f_till ORDER BY s_start LOOP IF c_start > r_end OR r_start IS NULL THEN IF c_start > r_end THEN r := (r_start, r_end); RETURN NEXT r; END IF; r_start := c_start; END IF; r_end := GREATEST(c_end, r_end); END LOOP; IF r_end IS NOT NULL THEN r := (r_start, r_end); RETURN NEXT r; END IF; END; $$ LANGUAGE 'plpgsql';
SELECT * FROM fn_pgspans('01.01.2007', '01.01.2011');
s_start | s_end | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-03-01 00:00:00 | 2008-03-02 04:56:56.228301 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-02 04:57:00 | 2008-03-07 12:29:49.143806 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-07 12:30:00 | 2008-03-08 20:56:28.941583 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-08 20:56:30 | 2008-03-11 21:20:58.638719 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-11 21:21:00 | 2008-03-20 13:51:19.202829 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-20 13:51:30 | 2008-03-30 15:54:59.841784 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-03-30 15:55:00 | 2008-04-02 13:53:53.391648 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-04-02 13:54:00 | 2008-04-13 16:20:48.130234 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-04-13 16:21:00 | 2008-04-22 23:24:27.941262 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-06 19:11:00 | 2009-07-07 00:05:29.434784 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-07 00:05:30 | 2009-07-11 10:38:50.779064 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009-07-11 10:39:00 | 2009-07-14 00:04:44.19723 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 rows fetched in 0.0028s (3.5938s) |
Function Scan on fn_pgspans (cost=0.00..260.00 rows=1000 width=16)
Why so?
If we look closer into the plan, we will see three sorts of same resultset (q
) on the same field (s_rn
)
Two of them are required to make MERGE JOIN
s, the third one to calculate the LAG(s_maxend) OVER (s_rn)
.
It's not clear why a MERGE JOIN
is chosen to join there resutlsets.
On one hand, it's the best method for joining two presorted resultsets, on the other hand, the resultsets are being sorted three times anyway.
Sorting resultset once on s_rn
and materializing the sorted results would make these MERGE JOIN
s much more efficient. And as a bonus, the conditionless MAX(s_rn)
used in a UNION
query could be retrieved just by taking the last result from the resultset.
However, 17 seconds is really very fast, almost as fast as the function-based solution in the previous version.
The current version's function-based solution runs for only 3.6 seconds and beats the former champion, MySQL, hands down. MySQL solution took 11 seconds.
Having window functions, working right as they should, is super nice.
A great work of PostgreSQL team that new version is.
To be continued.
[…] https://explainextended.com/2009/07/14/flattening-timespans-postgresql-8-4/ […]
Flattening timespans: PostgreSQL 8.4 - DbRunas
15 Jul 09 at 06:56