Efficient date range query: SQL Server at EXPLAIN EXTENDED

I have a table A that has a startDate and an endDate as 2 DATETIME columns besides some more other columns.

I have another table B that has one DATETIME column (call it dates column).

This is in SQL Server 2005.

How to best set up the indexes etc to get the following:
SELECT *
FROM a, b
WHERE a.startDate >= b.dates
 AND a.endDate < b.dates
&#91;/sourcecode&#93;

Both tables have several thousand records.
</blockquote>

Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.

A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).

However, this query is just the opposite: we search for a constaint between two unknown values.

If we create the indexes on <code>startDate</code> and <code>endDate</code> and issue a query with a condition like this:

<code>startDate <= value AND endDate > value</code>

, this will just give two range condition on two distinct indexes:

<code>-INF <= startDate <= value</code>
<code>value < endDate <= +INF</code>

Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.

There is a special class of indexes: <a href="http://en.wikipedia.org/wiki/R-tree">R-Tree</a> indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called <code>SPATIAL</code> indexes.

In both these systems they can only be defined over geometry data which implies 2-dimensional objects (<code>Point</code>s, <code>LineString</code>s and such).

There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):

<ul>
	<li><a href="/2009/07/01/overlapping-ranges-mysql/">Overlapping ranges: MySQL</a></li>
</ul>

However, SQL Server 2005 does not support SPATIAL indexes.

In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).

However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.

Let's create a sample table:

<a href="#" onclick="xcollapse('X9009');return false;">Table creation details</a>
 
<div id="X9009" style="display: none; ">

CREATE SCHEMA [20090811_ranges]
CREATE TABLE t_range (
 rge_id INT NOT NULL PRIMARY KEY,
 rge_from DATETIME NOT NULL,
 rge_till DATETIME NOT NULL,
 rge_stuffing VARCHAR(400) NOT NULL,
 rge_length AS DATEDIFF(day, rge_from, rge_till),
 rge_startdate AS CONVERT(DATETIME, CONVERT(VARCHAR(10), rge_from, 101), 101)
 )
GO
CREATE INDEX IX_range_from ON [20090811_ranges].t_range (rge_from)
CREATE INDEX IX_range_till ON [20090811_ranges].t_range (rge_till)
CREATE INDEX IX_range_length_startdate ON [20090811_ranges].t_range (rge_length, rge_startdate)
GO
SELECT RAND(20090811)
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
 INSERT
 INTO &#91;20090811_ranges&#93;.t_range
 VALUES (
 @cnt,
 DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)),
 DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)) + RAND() * 10 + 1,
 LEFT('Range ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('*', 400), 400)
 )
 SET @cnt = @cnt + 1
END
COMMIT
&#91;/sourcecode&#93;
</div>

Table <code>t_range</code> has 1,000,000 rows, with ranges starting each minute and having random length up to 11 days.

There are two computed columns in this table: <code>rge_length</code>, which returns a number of midnights between the start and the end of each range, and <code>rge_startdate</code>, which truncates the start's date to 00:00:00. And there is a composite index on these columns.

Here's the plain query:


DECLARE @date DATETIME
SET @date = '2009-01-01'

SELECT SUM(LEN(rge_stuffing))
FROM [20090811_ranges].t_range
WHERE rge_from <= @date
 AND rge_till > @date
View details
3432400

1 row fetched in 0.0002s (14.6247s)
Table 't_range'. Scan count 3, logical reads 61197, physical reads 161, read-ahead reads 50569, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
 CPU time = 577 ms, elapsed time = 14620 ms. 
SELECT
 Compute Scalar
 Stream Aggregate
 Parallelism (Gather Streams)
 Stream Aggregate
 Compute Scalar
 Clustered Index Scan [test].[20090811_ranges].[t_range]
It's more than 14 seconds.

Now, how can we improve the query?

To do this, we need to transform a range condition on both columns into an equality condition which we could later check.

To do this, I just split the starts of the ranges and their lengths into chunks of one day long.

How do we test if the range contains the given date? Say, Jan 1, 2009?

This date may possbily be contained by:

All ranges having length of 0 (that is, the range starts and ends on same date) and rge_startdate of Jan 1, 2009

All ranges having length of 1 and rge_startdate between Dec 31, 2008 and Jan 1, 2009

All ranges having length of 2 and rge_startdate between Dec 30, 2008 and Jan 1, 2009

… etc

This list is possibly infinite, however, we know the maximal length possible is 12 (or we would split the ranges into longer chunks), so this list is limited by the maximal possible length.

And each of the conditions above is coverable by plain B-Tree indexes.

To make the conditions even more efficient, we will split each of the ranges into subchunks which can be tested with equality conditions (as opposed to range conditions). This will make HASH JOIN method usable (in case the optimizer will decide to use it).

Here's how we do it:

length = 0 AND

rge_startdate = @date

length = 1 AND

rge_startdate = @date

OR rge_startdate = @date - 1

length = 2 AND

rge_startdate = @date

OR rge_startdate = @date - 1

OR rge_startdate = @date - 2

…etc

We can easily constuct the numbers required for the JOIN above using recursive CTEs.

The conditions above, however, are necessary but not sufficient: all ranges containing given @date are guaranteed to satisfy these conditions but there are ranges satisfying the conditions above but not containing the @date.

To fine filter the ranges, we will just add original rge_from <= @date AND rge_till > @date to the WHERE clause of the query.

And here's the query:
DECLARE @date DATETIME
SET @date = '2009-01-01'

;WITH rd AS
 (
 SELECT MAX(rge_length) AS length
 FROM [20090811_ranges].t_range
 UNION ALL
 SELECT length - 1
 FROM rd
 WHERE length > 0
 )
SELECT SUM(LEN(rge_stuffing))
FROM rd AS rd1
JOIN
 rd AS rd2
ON rd2.length <= rd1.length
JOIN &#91;20090811_ranges&#93;.t_range r
ON r.rge_length = rd1.length
 AND r.rge_startdate = CAST(CAST(CAST(@date AS FLOAT) AS INT) AS DATETIME) - rd2.length
 AND r.rge_from <= @date
 AND r.rge_till > @date
View details
3432400

1 row fetched in 0.0002s (0.0556s)
Table 'Worktable'. Scan count 8, logical reads 289, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_range'. Scan count 81, logical reads 31211, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
 CPU time = 94 ms, elapsed time = 55 ms. 
55 milliseconds. Same result instantly.

It should be noted that but a single range having a length of 100 or more will make the CTE fail due to recursion limit.

This can be adjusted using MAXRECURSION option.

In this case, the splitting period should be reconsidered (to be a week or even more, depending on the maximum length).

Share this:
Facebook
X


3432400
1 row fetched in 0.0002s (14.6247s)


3432400
1 row fetched in 0.0002s (0.0556s)

Written by Quassnoi

August 11th, 2009 at 11:00 pm

Posted in SQL Server

One Response to 'Efficient date range query: SQL Server'

Subscribe to comments with RSS

excellent solution, liking the use of spatial too thanks!

Lukek

26 Feb 14 at 17:23

EXPLAIN EXTENDED

Efficient date range query: SQL Server

One Response to 'Efficient date range query: SQL Server'

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow