EXPLAIN EXTENDED

How to create fast database queries

Overlapping ranges: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

This query needs to find all events that overlap the week.

The condition for overlapping ranges is well known and quite simple:

start_date < @week_end AND end_date > @week_start

However, in SQL more simple doesn't always mean more efficient, and the following condition, despite being more complex, can yield better performance:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

Let's create a sample table and see how so:

CREATE SCHEMA [20090630_events]
CREATE TABLE t_event (
        id INT NOT NULL PRIMARY KEY,
        evt_start DATETIME NOT NULL,
        evt_end DATETIME NOT NULL,
        evt_data VARCHAR(300) NOT NULL
)
GO
BEGIN TRANSACTION
SELECT  RAND(20090630)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        INSERT
        INTO    [20090630_events].t_event
        VALUES  (
                @cnt,
                DATEADD(minute, -@cnt, '2009-30-06'),
                DATEADD(day, FLOOR(RAND() * 35) + 1, DATEADD(minute, -@cnt, '2009-30-06')),
                LEFT('Value ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('-', 300), 300)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
CREATE INDEX ix_event_start ON [20090630_events].t_event (evt_start)
CREATE INDEX ix_event_end__start ON [20090630_events].t_event (evt_end) INCLUDE (evt_start)
GO

This table contains 1,000,000 rows with events starting every minute and being 1 to 35 days long.

Both evt_start and evt_end are indexed, the index on latter also having evt_start as an included column.

Now, let's query the table using the first condition:

DECLARE @week_start DATETIME
DECLARE @week_end DATETIME
SET @week_start = '2009 Jun 30';
SET @week_end = DATEADD(week, 1, @week_start)
SELECT  SUM(LEN(evt_data))
FROM    [20090630_events].t_event e
WHERE   evt_end >= @week_start
        AND evt_start <= @week_end

7766400
1 row fetched in 0.0001s (5.1874s)
Table 't_event'. Scan count 3, logical reads 47892, physical reads 76, read-ahead reads 24350, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 282 ms,  elapsed time = 5175 ms. 

This query uses the simple condition and runs for then 5 seconds.

If we look into the plan, we'll see the following:

SELECT
  Compute Scalar
    Stream Aggregate
      Parallelism (Gather Streams)
        Stream Aggregate
          Compute Scalar
            Clustered Index Scan ([20090630_events].[t_event])

In other words, neither of the indexes is used: SQL Server thinks that full table scan will be better in this case, since no single index is selective enough here and there is no efficient way to combine the indexes.

Now, let's use the more complex condition:

DECLARE @week_start DATETIME
DECLARE @week_end DATETIME
SET @week_start = '2009 Jun 30';
SET @week_end = DATEADD(week, 1, @week_start)
SELECT  SUM(LEN(evt_data))
FROM    [20090630_events].t_event e
WHERE   (evt_start > @week_start AND evt_start <= @week_end)
        OR (@week_start BETWEEN evt_start AND evt_end)
7766400
1 row fetched in 0.0002s (0.2547s)
Table 't_event'. Scan count 2, logical reads 79372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 156 ms,  elapsed time = 254 ms. 

This query is much more efficient: it completes in but 0.25 seconds, almost 20 times as fast as the original one.

Why such a difference?

Let's see the execution plan:

SELECT
  Compute Scalar
    Stream Aggregate
      Compute Scalar
        Nested Loops (Inner Join)
          Hash Match (Aggregate)
            Concatenation
              Index Seek ([20090630_events].[t_event].[ix_event_start])
              Index Seek ([20090630_events].[t_event].[ix_event_end__start])
          Key Lookup ([20090630_events].[t_event])

We see that both indexes are used.

The OR condition, which is in general cannot be optimized well by database systems, has one advantage in this case: the query above can be rewritten using UNION:

DECLARE @week_start DATETIME
DECLARE @week_end DATETIME
SET @week_start = '2009 Jun 30';
SET @week_end = DATEADD(week, 1, @week_start)

SELECT  SUM(LEN(e.evt_data))
FROM    (
        SELECT  id
        FROM    [20090630_events].t_event
        WHERE   evt_start > @week_start AND evt_start <= @week_end
        UNION ALL
        SELECT  id
        FROM    [20090630_events].t_event
        WHERE   @week_start BETWEEN evt_start AND evt_end
        ) eo
JOIN    [20090630_events].t_event e
ON      e.id = eo.id
OPTION  (LOOP JOIN)

7766400
1 row fetched in 0.0002s (0.0951s)
Table 't_event'. Scan count 5, logical reads 79390, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 109 ms,  elapsed time = 95 ms. 

Rewriting as a UNION separates two OR conditions and lets more efficient index be used for each part.

The rewritten query uses essentially same plan as the original one:

SELECT
  Compute Scalar
    Stream Aggregate
      Parallelism (Gather Streams)
        Stream Aggregate
          Compute Scalar
            Nested Loops (Inner Join)
              Sort
                Concatenation
                  Index Seek ([20090630_events].[t_event].[ix_event_start])
                  Index Seek ([20090630_events].[t_event].[ix_event_end__start])
              Clustered Index Seek ([20090630_events].[t_event])

In fact, with the current values of @week_start and @week_end, it's even more efficient than our original query (since we can safely use UNION ALL here).

However, this syntax robs the optimizer of an opportunity to revert to Clustered Table Scan should the values change, and therefore has inferior worst case performance.

SQL Server's optimizer is smart enough to see that a logical disjunction of two conditions can be represented as a UNION of these conditions and each of the conditions can be efficiently searched separately using the appropriate index.

Written by Quassnoi

June 30th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply