Efficient date range query: SQL Server
From Stack Overflow:
I have a table
Athat has astartDateand anendDateas 2DATETIMEcolumns besides some more other columns.I have another table
Bthat has oneDATETIMEcolumn (call itdatescolumn).This is in SQL Server 2005.
How to best set up the indexes etc to get the following:
SELECT * FROM a, b WHERE a.startDate >= b.dates AND a.endDate < b.datesBoth tables have several thousand records.
Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.
A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).
However, this query is just the opposite: we search for a constaint between two unknown values.
If we create the indexes on startDate and endDate and issue a query with a condition like this:
startDate <= value AND endDate > value
, this will just give two range condition on two distinct indexes:
-INF <= startDate <= value
value < endDate <= +INF
Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.
There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL indexes.
In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Points, LineStrings and such).
There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):
However, SQL Server 2005 does not support SPATIAL indexes.
In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).
However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.
Let's create a sample table:
Table creation details
Table t_range has 1,000,000 rows, with ranges starting each minute and having random length up to 11 days.
There are two computed columns in this table: rge_length, which returns a number of midnights between the start and the end of each range, and rge_startdate, which truncates the start's date to 00:00:00. And there is a composite index on these columns.
Here's the plain query:
DECLARE @date DATETIME
SET @date = '2009-01-01'
SELECT SUM(LEN(rge_stuffing))
FROM [20090811_ranges].t_range
WHERE rge_from <= @date
AND rge_till > @date
It's more than 14 seconds.
Now, how can we improve the query?
To do this, we need to transform a range condition on both columns into an equality condition which we could later check.
To do this, I just split the starts of the ranges and their lengths into chunks of one day long.
How do we test if the range contains the given date? Say, Jan 1, 2009?
This date may possbily be contained by:
- All ranges having length of 0 (that is, the range starts and ends on same date) and
rge_startdateof Jan 1, 2009 - All ranges having length of 1 and
rge_startdatebetween Dec 31, 2008 and Jan 1, 2009 - All ranges having length of 2 and
rge_startdatebetween Dec 30, 2008 and Jan 1, 2009 - … etc
This list is possibly infinite, however, we know the maximal length possible is 12 (or we would split the ranges into longer chunks), so this list is limited by the maximal possible length.
And each of the conditions above is coverable by plain B-Tree indexes.
To make the conditions even more efficient, we will split each of the ranges into subchunks which can be tested with equality conditions (as opposed to range conditions). This will make HASH JOIN method usable (in case the optimizer will decide to use it).
Here's how we do it:
length = 0 ANDrge_startdate = @date
length = 1 ANDrge_startdate = @dateOR rge_startdate = @date - 1
length = 2 ANDrge_startdate = @dateOR rge_startdate = @date - 1OR rge_startdate = @date - 2
- …etc
We can easily constuct the numbers required for the JOIN above using recursive CTEs.
The conditions above, however, are necessary but not sufficient: all ranges containing given @date are guaranteed to satisfy these conditions but there are ranges satisfying the conditions above but not containing the @date.
To fine filter the ranges, we will just add original rge_from <= @date AND rge_till > @date to the WHERE clause of the query.
And here's the query:
DECLARE @date DATETIME
SET @date = '2009-01-01'
;WITH rd AS
(
SELECT MAX(rge_length) AS length
FROM [20090811_ranges].t_range
UNION ALL
SELECT length - 1
FROM rd
WHERE length > 0
)
SELECT SUM(LEN(rge_stuffing))
FROM rd AS rd1
JOIN
rd AS rd2
ON rd2.length <= rd1.length
JOIN [20090811_ranges].t_range r
ON r.rge_length = rd1.length
AND r.rge_startdate = CAST(CAST(CAST(@date AS FLOAT) AS INT) AS DATETIME) - rd2.length
AND r.rge_from <= @date
AND r.rge_till > @date
55 milliseconds. Same result instantly.
It should be noted that but a single range having a length of 100 or more will make the CTE fail due to recursion limit.
This can be adjusted using MAXRECURSION option.
In this case, the splitting period should be reconsidered (to be a week or even more, depending on the maximum length).
Subscribe in a reader