From Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000
The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000
which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
This is also quite a common task.
The algorithm here is simple:
- Order all timespans by
- Initialize a variable which will hold a running maximum of the timespans'
- Take the timespans one by one, and for each next timespan, check its
startagainst a running maximum of the previous
stop's. If the current
startis greater than the maximum of previous
stop's, it means that all timespans that started earlier have been ended, and we have a gap
- Update the running maximum
SQL Server does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.
This is one of quite rare cases when cursor in
SQL Server work faster than set-based operations.
Let's create a sample table and see how it works:
CREATE SCHEMA [20090611_timespans] CREATE TABLE t_span ( s_date DATETIME NOT NULL, s_start DATETIME NOT NULL, s_stop DATETIME NOT NULL ) GO CREATE INDEX ix_span_date ON [20090611_timespans].t_span (s_date) GO DECLARE @day INT DECLARE @cnt FLOAT SELECT RAND(20090611) BEGIN TRANSACTION SET @day = 499 WHILE @day >= 0 BEGIN SET @cnt = 0 WHILE @cnt < 2880 BEGIN INSERT INTO [20090611_timespans].t_span VALUES ( CAST('2009-06-11' AS DATETIME) - @day, CAST('00:00:00' AS DATETIME) + @cnt / 2880, CAST('00:00:00' AS DATETIME) + (@cnt + (RAND() * 10) + 0.5) / 2880 ) SET @cnt = @cnt + 1 END SET @day = @day - 1 END COMMIT GO
This table has 1,440,000 rows, with timespans starting each 30 seconds and having a random length from 15 to 315 seconds.
Dates and times are stored separately, and only the
s_date field is indexed. It's not the best design possible, but since the author mentioned it's a legacy table, there's not much that can be done with it.
Now, here's the function:
CREATE FUNCTION [20090611_timespans].fn_spans(@p_from DATETIME, @p_till DATETIME) RETURNS @t TABLE ( q_start DATETIME NOT NULL, q_end DATETIME NOT NULL ) AS BEGIN DECLARE @qs DATETIME DECLARE @qe DATETIME DECLARE @ms DATETIME DECLARE @me DATETIME DECLARE cr_span CURSOR FAST_FORWARD FOR SELECT s_date + s_start AS q_start, s_date + s_stop AS q_end FROM [20090611_timespans].t_span WHERE s_date BETWEEN @p_from - 1 AND @p_till AND s_date + s_start >= @p_from AND s_date + s_stop <= @p_till ORDER BY q_start OPEN cr_span FETCH NEXT FROM cr_span INTO @qs, @qe SET @ms = @qs SET @me = @qe WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM cr_span INTO @qs, @qe IF @qs > @me BEGIN INSERT INTO @t VALUES (@ms, @me) SET @ms = @qs END SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END END IF @ms IS NOT NULL BEGIN INSERT INTO @t VALUES (@ms, @me) END CLOSE cr_span RETURN END GO
This is quite straightforward. We use the variable
@me to store the running maximum and an additional variable
@ms to store the beginning of the current contiguous timespan. This will save us a join.
Notice the extra condition in a cursor query:
WHERE s_date BETWEEN @p_from - 1 AND @p_till
This may seem to be redundant, but actually this is a coarse filter for the index on
s_date to be usable. Additional conditions in
WHERE will fine filter the rows.
Here's the result of the query:
SELECT * FROM [20090611_timespans].fn_spans('2007-01-01', '2011-01-01')
|2008-01-29 00:00:00.000||2008-01-30 09:24:28.333|
|2008-01-30 09:24:30.000||2008-02-05 16:51:48.270|
|2008-02-05 16:52:00.000||2008-02-09 22:01:57.027|
|2008-02-09 22:02:00.000||2008-02-11 16:07:20.913|
|2008-02-11 16:07:30.000||2008-02-16 11:43:58.093|
|2008-02-16 11:44:00.000||2008-02-17 18:39:52.677|
|2008-02-17 18:40:00.000||2008-02-28 00:53:57.173|
|2008-02-28 00:54:00.000||2008-03-03 17:34:59.030|
|2008-03-03 17:35:00.000||2008-03-04 23:35:50.357|
|2008-03-04 23:36:00.000||2008-03-10 10:27:25.783|
|2009-05-30 04:58:00.000||2009-06-01 00:31:51.777|
|2009-06-01 00:32:00.000||2009-06-08 22:19:57.493|
|2009-06-08 22:20:00.000||2009-06-12 00:03:03.817|
|99 rows fetched in 0.0030s (24.4530s)|
The query completes in 25 seconds, which is quite an acceptable time for a million and a half rows.
And SQL Server cursors don't seem so terrible after all, being properly used.
Let's try to run this query on a shorter range, like a day, which according to the author's comment is the most popular query condition:
SELECT * FROM [20090611_timespans].fn_spans('2008-01-30', '2008-01-31')
|2008-01-30 00:00:00.000||2008-01-30 09:24:28.333|
|2008-01-30 09:24:30.000||2008-01-30 23:58:49.763|
|2 rows fetched in 0.0002s (0.3208s)|
This is only 0.32 seconds which is next to instant.