Overlapping ranges: SQL Server
Comments enabled. I *really* need your comment
From Stack Overflow:
I have an event table that specifies a date range with
start_date
andend_date
fields.I have another date range, specified in code, that defines the current week as
week_start
andweek_end
.I'd like to query all events for the week.
This query needs to find all events that overlap the week.
The condition for overlapping ranges is well known and quite simple:
start_date < @week_end AND end_date > @week_start
However, in SQL more simple
doesn't always mean more efficient
, and the following condition, despite being more complex, can yield better performance:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
Let's create a sample table and see how so:
CREATE SCHEMA [20090630_events] CREATE TABLE t_event ( id INT NOT NULL PRIMARY KEY, evt_start DATETIME NOT NULL, evt_end DATETIME NOT NULL, evt_data VARCHAR(300) NOT NULL ) GO BEGIN TRANSACTION SELECT RAND(20090630) DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 1000000 BEGIN INSERT INTO [20090630_events].t_event VALUES ( @cnt, DATEADD(minute, -@cnt, '2009-30-06'), DATEADD(day, FLOOR(RAND() * 35) + 1, DATEADD(minute, -@cnt, '2009-30-06')), LEFT('Value ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('-', 300), 300) ) SET @cnt = @cnt + 1 END COMMIT GO CREATE INDEX ix_event_start ON [20090630_events].t_event (evt_start) CREATE INDEX ix_event_end__start ON [20090630_events].t_event (evt_end) INCLUDE (evt_start) GO
This table contains 1,000,000 rows with events starting every minute and being 1 to 35 days long.
Both evt_start
and evt_end
are indexed, the index on latter also having evt_start
as an included column.
Now, let's query the table using the first condition:
DECLARE @week_start DATETIME DECLARE @week_end DATETIME SET @week_start = '2009 Jun 30'; SET @week_end = DATEADD(week, 1, @week_start) SELECT SUM(LEN(evt_data)) FROM [20090630_events].t_event e WHERE evt_end >= @week_start AND evt_start <= @week_end
7766400 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (5.1874s) |
Table 't_event'. Scan count 3, logical reads 47892, physical reads 76, read-ahead reads 24350, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 282 ms, elapsed time = 5175 ms.
This query uses the simple condition and runs for then 5 seconds.
If we look into the plan, we'll see the following:
SELECT Compute Scalar Stream Aggregate Parallelism (Gather Streams) Stream Aggregate Compute Scalar Clustered Index Scan ([20090630_events].[t_event])
In other words, neither of the indexes is used: SQL Server thinks that full table scan will be better in this case, since no single index is selective enough here and there is no efficient way to combine the indexes.
Now, let's use the more complex condition:
DECLARE @week_start DATETIME DECLARE @week_end DATETIME SET @week_start = '2009 Jun 30'; SET @week_end = DATEADD(week, 1, @week_start) SELECT SUM(LEN(evt_data)) FROM [20090630_events].t_event e WHERE (evt_start > @week_start AND evt_start <= @week_end) OR (@week_start BETWEEN evt_start AND evt_end)
7766400 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.2547s) |
Table 't_event'. Scan count 2, logical reads 79372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 156 ms, elapsed time = 254 ms.
This query is much more efficient: it completes in but 0.25 seconds, almost 20 times as fast as the original one.
Why such a difference?
Let's see the execution plan:
SELECT Compute Scalar Stream Aggregate Compute Scalar Nested Loops (Inner Join) Hash Match (Aggregate) Concatenation Index Seek ([20090630_events].[t_event].[ix_event_start]) Index Seek ([20090630_events].[t_event].[ix_event_end__start]) Key Lookup ([20090630_events].[t_event])
We see that both indexes are used.
The OR
condition, which is in general cannot be optimized well by database systems, has one advantage in this case: the query above can be rewritten using UNION
:
DECLARE @week_start DATETIME DECLARE @week_end DATETIME SET @week_start = '2009 Jun 30'; SET @week_end = DATEADD(week, 1, @week_start) SELECT SUM(LEN(e.evt_data)) FROM ( SELECT id FROM [20090630_events].t_event WHERE evt_start > @week_start AND evt_start <= @week_end UNION ALL SELECT id FROM [20090630_events].t_event WHERE @week_start BETWEEN evt_start AND evt_end ) eo JOIN [20090630_events].t_event e ON e.id = eo.id OPTION (LOOP JOIN)
7766400 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0951s) |
Table 't_event'. Scan count 5, logical reads 79390, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 109 ms, elapsed time = 95 ms.
Rewriting as a UNION
separates two OR
conditions and lets more efficient index be used for each part.
The rewritten query uses essentially same plan as the original one:
SELECT Compute Scalar Stream Aggregate Parallelism (Gather Streams) Stream Aggregate Compute Scalar Nested Loops (Inner Join) Sort Concatenation Index Seek ([20090630_events].[t_event].[ix_event_start]) Index Seek ([20090630_events].[t_event].[ix_event_end__start]) Clustered Index Seek ([20090630_events].[t_event])
In fact, with the current values of @week_start
and @week_end
, it's even more efficient than our original query (since we can safely use UNION ALL
here).
However, this syntax robs the optimizer of an opportunity to revert to Clustered Table Scan
should the values change, and therefore has inferior worst case performance.
SQL Server's optimizer is smart enough to see that a logical disjunction of two conditions can be represented as a UNION
of these conditions and each of the conditions can be efficiently searched separately using the appropriate index.