EXPLAIN EXTENDED

How to create fast database queries

Flattening timespans: Oracle

Comments enabled. I *really* need your comment

In my previous article:

I described a function that merges intersecting timespanss into one contiguous range in SQL Server.

This query turned out to be a good illustration for window functions, so now we'll try to repeat this query in Oracle which is superb in terms of window functions support.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

As you may remember, we had to use the cursors in the SQL Server decision, since SQL Server lacks a way to calculate a running maximum.

But Oracle does have such a way, so we don't need cursors and functions here, and everything can be done in a single query.

The principle remains the same and is just changed a little to fit into one query:

  1. Assign a ROW_NUMBER() to each row (since the rows lack a PRIMARY KEY, we will need it later to use as one).
  2. Order the rows by start and calculate running maximum for the stop.
  3. Pick all rows that have a start greater than the running maximum of the previous row. Having such a timespan means that all timespans evaluated before had started earlier than the given one (since they are ordered by start) and also had ended earlier than the given one have started (since its start exceeds the running maximum). Therefore, we have a gap.
  4. Pick the ROW_NUMBER() of each gap row along with the ROW_NUMBER() of the record that comes right before the gap row next to the given one.
  5. Use these ROW_NUMBER()'s to find out the start and the stop of the corresponding records. These will be the timespan bounds.

Let's create a sample table and compose the query:

CREATE TABLE t_span (
        s_start DATE NOT NULL,
        s_end DATE NOT NULL
        )
/
BEGIN
        DBMS_RANDOM.seed(20090612);
END;
/
INSERT
INTO    t_span
SELECT  TO_DATE('12.06.2009', 'dd.mm.yyyy') - level / 2880,
        TO_DATE('12.06.2009', 'dd.mm.yyyy') - (level - ((DBMS_RANDOM.value() * 10) + 0.5)) / 2880
FROM    dual
CONNECT BY
        level <= 1440000
/
COMMIT
/
&#91;/sourcecode&#93;

Again, this table has <strong>1,440,000</strong> rows with timespans starting every <strong>30</strong> seconds and having random length from <strong>15</strong> to <strong>315</strong> seconds.

Now, let's do the steps described above:


WITH    q AS
        (
        SELECT  s_start,
                s_end,
                MAX(s_end) OVER (ORDER BY s_start) AS s_maxend,
                ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn
        FROM    t_span s
        ORDER BY
                s_start
        )

This will select all rows along with ROW_NUMBER() (aliased as s_rn) and a running maximum (aliases as s_maxend). The running maximum is easily selected using a window function MAX(s_end) OVER (ORDER BY s_start), which selects the maximum value of s_end of all records processed so far in s_start order.

WITH    q AS
        (
        SELECT  s_start,
                s_end,
                MAX(s_end) OVER (ORDER BY s_start) AS s_maxend,
                ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn
        FROM    t_span s
        ORDER BY
                s_start
        )
SELECT  s_rn
FROM    (
        SELECT  q.*, LAG(s_maxend) OVER (ORDER BY s_start) AS s_maxpend
        FROM    q
        )
WHERE   s_start > s_maxpend
UNION ALL
SELECT  MAX(s_rn) + 1
FROM    q

S_RN
40200
46672
63012
73607
81750
103166
118775
142771
144124
152977
1437767
1437768
1440001
73 rows fetched in 0.0015s (21.0311s)

This selects the ROW_NUMBER()'s of all gap rows along with the latest ROW_NUMBER(). Since the number of resulting ranges is always one more than the number of gaps (1 gap results in two ranges, 2 gaps result in three ranges etc.), we need this to find out the upper bound of the final range (which of course will always be the stop of the last timespan).

WITH    q AS
        (
        SELECT  s_start,
                s_end,
                MAX(s_end) OVER (ORDER BY s_start) AS s_maxend,
                ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn
        FROM    t_span s
        ORDER BY
                s_start
        )
SELECT  NVL(LAG(s_rn) OVER (ORDER BY s_rn), 1) AS s_prn, s_rn - 1 AS s_crn
FROM    (
        SELECT  s_rn
        FROM    (
                SELECT  q.*, LAG(s_maxend) OVER (ORDER BY s_start) AS s_maxpend
                FROM    q
                )
        WHERE   s_start > s_maxpend
        UNION ALL
        SELECT  MAX(s_rn) + 1
        FROM    q
        )

S_PRN S_CRN
1 40199
40200 46671
46672 63011
63012 73606
73607 81749
81750 103165
103166 118774
118775 142770
142771 144123
144124 152976
1411348 1437766
1437767 1437767
1437768 1440000
73 rows fetched in 0.0020s (21.6405s)

We make a range resultset from the gap resultset. This is easily done using Oracle's LAG function which selects the value of a given column from the previous row.

Now all that is left to do is find the actual values of the start and stop times from the range row numbers. This is done using two simple JOIN's:

WITH    q AS
        (
        SELECT  s_start,
                s_end,
                MAX(s_end) OVER (ORDER BY s_start) AS s_maxend,
                ROW_NUMBER() OVER (ORDER BY s_start) AS s_rn
        FROM    t_span s
        ORDER BY
                s_start
        )
SELECT  q1.s_start,
        q2.s_end
FROM    (
        SELECT  NVL(LAG(s_rn) OVER (ORDER BY s_rn), 1) AS s_prn, s_rn - 1 AS s_crn
        FROM    (
                SELECT  s_rn
                FROM    (
                        SELECT  q.*, LAG(s_maxend) OVER (ORDER BY s_start) AS s_maxpend
                        FROM    q
                        )
                WHERE   s_start > s_maxpend
                UNION ALL
                SELECT  MAX(s_rn) + 1
                FROM    q
                )
        ) g
JOIN    q q1
ON      q1.s_rn = g.s_prn
JOIN    q q2
ON      q2.s_rn = g.s_crn
ORDER BY
        q1.s_start
S_START S_END
29.01.2008 00:00:00 11.02.2008 22:59:24
11.02.2008 22:59:30 14.02.2008 04:55:16
14.02.2008 04:55:30 19.02.2008 21:05:27
19.02.2008 21:05:30 23.02.2008 13:22:57
23.02.2008 13:23:00 26.02.2008 09:14:22
26.02.2008 09:14:30 04.03.2008 19:42:17
04.03.2008 19:42:30 10.03.2008 05:46:45
10.03.2008 05:47:00 18.03.2008 13:44:48
18.03.2008 13:45:00 19.03.2008 01:01:17
19.03.2008 01:01:30 22.03.2008 02:47:52
02.06.2009 01:13:30 11.06.2009 05:22:46
11.06.2009 05:23:00 11.06.2009 05:23:27
11.06.2009 05:23:30 12.06.2009 00:03:14
73 rows fetched in 0.0022s (23.4217s)
SELECT STATEMENT 
 TEMP TABLE TRANSFORMATION 
  LOAD AS SELECT 
   WINDOW SORT
    TABLE ACCESS FULL, 20090612_timespans.T_SPAN
  SORT ORDER BY
   HASH JOIN 
    VIEW 
     TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6616_A802CC
    HASH JOIN 
     VIEW 
      TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6616_A802CC
     VIEW 
      WINDOW SORT
       VIEW 
        UNION-ALL 
         VIEW 
          WINDOW SORT
           VIEW 
            TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6616_A802CC
         SORT AGGREGATE
          VIEW 
           TABLE ACCESS FULL, SYS.SYS_TEMP_0FD9D6616_A802CC

This is exactly what we were after, and even slightly faster than with SQL Server.

Written by Quassnoi

June 12th, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply