EXPLAIN EXTENDED

How to create fast database queries

Flattening timespans: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

This is also quite a common task.

The algorithm here is simple:

  1. Order all timespans by start
  2. Initialize a variable which will hold a running maximum of the timespans' stop
  3. Take the timespans one by one, and for each next timespan, check its start against a running maximum of the previous stop's. If the current start is greater than the maximum of previous stop's, it means that all timespans that started earlier have been ended, and we have a gap
  4. Update the running maximum

Unfortunately, SQL Server does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.

This is one of quite rare cases when cursor in SQL Server work faster than set-based operations.

Let's create a sample table and see how it works:

CREATE SCHEMA [20090611_timespans]
CREATE TABLE t_span (
s_date DATETIME NOT NULL,
s_start DATETIME NOT NULL,
s_stop DATETIME NOT NULL
)
GO
CREATE INDEX ix_span_date ON [20090611_timespans].t_span (s_date)
GO
DECLARE @day INT
DECLARE @cnt FLOAT
SELECT  RAND(20090611)
BEGIN TRANSACTION
SET @day = 499
WHILE @day >= 0
BEGIN
SET @cnt = 0
WHILE @cnt < 2880
        BEGIN
                INSERT
                INTO    [20090611_timespans].t_span
                VALUES  (
                        CAST('2009-06-11' AS DATETIME) - @day,
                        CAST('00:00:00' AS DATETIME) + @cnt / 2880,
                        CAST('00:00:00' AS DATETIME) + (@cnt + (RAND() * 10) + 0.5) / 2880
                        )
                SET @cnt = @cnt + 1
        END
        SET @day = @day - 1
END
COMMIT
GO

This table has 1,440,000 rows, with timespans starting each 30 seconds and having a random length from 15 to 315 seconds.

Dates and times are stored separately, and only the s_date field is indexed. It's not the best design possible, but since the author mentioned it's a legacy table, there's not much that can be done with it.

Now, here's the function:

CREATE FUNCTION [20090611_timespans].fn_spans(@p_from DATETIME, @p_till DATETIME)
RETURNS @t TABLE
(
q_start DATETIME NOT NULL,
q_end DATETIME NOT NULL
)
AS
BEGIN
DECLARE @qs DATETIME
DECLARE @qe DATETIME
DECLARE @ms DATETIME
DECLARE @me DATETIME
DECLARE cr_span CURSOR FAST_FORWARD
FOR
SELECT  s_date + s_start AS q_start,
s_date + s_stop AS q_end
FROM    [20090611_timespans].t_span
WHERE   s_date BETWEEN @p_from - 1 AND @p_till
AND s_date + s_start >= @p_from
AND s_date + s_stop <= @p_till
        ORDER BY
                q_start
        OPEN    cr_span
        FETCH   NEXT
        FROM    cr_span
        INTO    @qs, @qe
        SET @ms = @qs
        SET @me = @qe
        WHILE @@FETCH_STATUS = 0
        BEGIN
                FETCH   NEXT
                FROM    cr_span
                INTO    @qs, @qe
                IF @qs > @me
BEGIN
INSERT
INTO    @t
VALUES (@ms, @me)
SET @ms = @qs
END
SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END
END
IF @ms IS NOT NULL
BEGIN
INSERT
INTO    @t
VALUES  (@ms, @me)
END
CLOSE   cr_span
RETURN
END
GO

This is quite straightforward. We use the variable @me to store the running maximum and an additional variable @ms to store the beginning of the current contiguous timespan. This will save us a join.

Notice the extra condition in a cursor query: WHERE s_date BETWEEN @p_from - 1 AND @p_till

This may seem to be redundant, but actually this is a coarse filter for the index on s_date to be usable. Additional conditions in WHERE will fine filter the rows.

Here's the result of the query:

SELECT  *
FROM    [20090611_timespans].fn_spans('2007-01-01', '2011-01-01')
q_start q_end
2008-01-29 00:00:00.000 2008-01-30 09:24:28.333
2008-01-30 09:24:30.000 2008-02-05 16:51:48.270
2008-02-05 16:52:00.000 2008-02-09 22:01:57.027
2008-02-09 22:02:00.000 2008-02-11 16:07:20.913
2008-02-11 16:07:30.000 2008-02-16 11:43:58.093
2008-02-16 11:44:00.000 2008-02-17 18:39:52.677
2008-02-17 18:40:00.000 2008-02-28 00:53:57.173
2008-02-28 00:54:00.000 2008-03-03 17:34:59.030
2008-03-03 17:35:00.000 2008-03-04 23:35:50.357
2008-03-04 23:36:00.000 2008-03-10 10:27:25.783
2009-05-30 04:58:00.000 2009-06-01 00:31:51.777
2009-06-01 00:32:00.000 2009-06-08 22:19:57.493
2009-06-08 22:20:00.000 2009-06-12 00:03:03.817
99 rows fetched in 0.0030s (24.4530s)

The query completes in 25 seconds, which is quite an acceptable time for a million and a half rows.

And SQL Server cursors don't seem so terrible after all, being properly used.

Let's try to run this query on a shorter range, like a day, which according to the author's comment is the most popular query condition:

SELECT  *
FROM    [20090611_timespans].fn_spans('2008-01-30', '2008-01-31')
q_start q_end
2008-01-30 00:00:00.000 2008-01-30 09:24:28.333
2008-01-30 09:24:30.000 2008-01-30 23:58:49.763
2 rows fetched in 0.0002s (0.3208s)

This is only 0.32 seconds which is next to instant.

Written by Quassnoi

June 11th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply