Efficient date range query: SQL Server
From Stack Overflow:
I have a table
A
that has astartDate
and anendDate
as 2DATETIME
columns besides some more other columns.I have another table
B
that has oneDATETIME
column (call itdates
column).This is in SQL Server 2005.
How to best set up the indexes etc to get the following:
SELECT * FROM a, b WHERE a.startDate >= b.dates AND a.endDate < b.dates [/sourcecode] Both tables have several thousand records. </blockquote> Queries of this kind involve searching a table for a range that contains a given constant and plain <strong>B-Tree</strong> indexes are not good for this query. A <strong>B-Tree</strong> index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants). However, this query is just the opposite: we search for a constaint between two unknown values. If we create the indexes on <code>startDate</code> and <code>endDate</code> and issue a query with a condition like this: <code>startDate <= value AND endDate > value</code> , this will just give two range condition on two distinct indexes: <code>-INF <= startDate <= value</code> <code>value < endDate <= +INF</code> Each of these ranges is too large to be efficient for searching, though their intersection can be quite small. There is a special class of indexes: <a href="http://en.wikipedia.org/wiki/R-tree"><strong>R-Tree</strong></a> indexes, which handle the conditions as described above just perfectly. They are mostly used to query <strong>2</strong>-dimensional <strong>GIS</strong> data, that's why in <strong>MySQL</strong> they are called <code>SPATIAL</code> indexes. In both these systems they can only be defined over geometry data which implies <strong>2</strong>-dimensional objects (<code>Point</code>s, <code>LineString</code>s and such). There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in <strong>MySQL</strong>): <ul> <li><a href="/2009/07/01/overlapping-ranges-mysql/"><strong>Overlapping ranges: MySQL</strong></a></li> </ul> However, <strong>SQL Server 2005</strong> does not support <strong>SPATIAL</strong> indexes. In theory, an <strong>R-Tree</strong> index could be built over <strong>1</strong>-dimensional data just as well, but no database of the major four supports it (as for now). However, quite a close structure can be implemented using plain <strong>B-Tree</strong> indexes, and in this article I'll show how to do it. Let's create a sample table: <span id="more-2601"></span> <a href="#" onclick="xcollapse('X9009');return false;"><strong>Table creation details</strong></a> <br /> <div id="X9009" style="display: none; "> CREATE SCHEMA [20090811_ranges] CREATE TABLE t_range ( rge_id INT NOT NULL PRIMARY KEY, rge_from DATETIME NOT NULL, rge_till DATETIME NOT NULL, rge_stuffing VARCHAR(400) NOT NULL, rge_length AS DATEDIFF(day, rge_from, rge_till), rge_startdate AS CONVERT(DATETIME, CONVERT(VARCHAR(10), rge_from, 101), 101) ) GO CREATE INDEX IX_range_from ON [20090811_ranges].t_range (rge_from) CREATE INDEX IX_range_till ON [20090811_ranges].t_range (rge_till) CREATE INDEX IX_range_length_startdate ON [20090811_ranges].t_range (rge_length, rge_startdate) GO SELECT RAND(20090811) BEGIN TRANSACTION DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 1000000 BEGIN INSERT INTO [20090811_ranges].t_range VALUES ( @cnt, DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)), DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)) + RAND() * 10 + 1, LEFT('Range ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('*', 400), 400) ) SET @cnt = @cnt + 1 END COMMIT [/sourcecode] </div> Table <code>t_range</code> has <strong>1,000,000</strong> rows, with ranges starting each minute and having random length up to <strong>11</strong> days. There are two computed columns in this table: <code>rge_length</code>, which returns a number of midnights between the start and the end of each range, and <code>rge_startdate</code>, which truncates the start's date to <strong>00:00:00</strong>. And there is a composite index on these columns. Here's the plain query: DECLARE @date DATETIME SET @date = '2009-01-01' SELECT SUM(LEN(rge_stuffing)) FROM [20090811_ranges].t_range WHERE rge_from <= @date AND rge_till > @dateIt's more than 14 seconds.
Now, how can we improve the query?
To do this, we need to transform a range condition on both columns into an equality condition which we could later check.
To do this, I just split the starts of the ranges and their lengths into chunks of one day long.
How do we test if the range contains the given date? Say, Jan 1, 2009?
This date may possbily be contained by:
- All ranges having length of 0 (that is, the range starts and ends on same date) and
rge_startdate
of Jan 1, 2009- All ranges having length of 1 and
rge_startdate
between Dec 31, 2008 and Jan 1, 2009- All ranges having length of 2 and
rge_startdate
between Dec 30, 2008 and Jan 1, 2009- … etc
This list is possibly infinite, however, we know the maximal length possible is 12 (or we would split the ranges into longer chunks), so this list is limited by the maximal possible length.
And each of the conditions above is coverable by plain B-Tree indexes.
To make the conditions even more efficient, we will split each of the ranges into subchunks which can be tested with equality conditions (as opposed to range conditions). This will make HASH JOIN method usable (in case the optimizer will decide to use it).
Here's how we do it:
length = 0 AND
rge_startdate = @date
length = 1 AND
rge_startdate = @date
OR rge_startdate = @date - 1
length = 2 AND
rge_startdate = @date
OR rge_startdate = @date - 1
OR rge_startdate = @date - 2
- …etc
We can easily constuct the numbers required for the
JOIN
above using recursive CTEs.The conditions above, however, are necessary but not sufficient: all ranges containing given
@date
are guaranteed to satisfy these conditions but there are ranges satisfying the conditions above but not containing the@date
.To fine filter the ranges, we will just add original
rge_from <= @date AND rge_till > @date
to theWHERE
clause of the query.And here's the query:
DECLARE @date DATETIME SET @date = '2009-01-01' ;WITH rd AS ( SELECT MAX(rge_length) AS length FROM [20090811_ranges].t_range UNION ALL SELECT length - 1 FROM rd WHERE length > 0 ) SELECT SUM(LEN(rge_stuffing)) FROM rd AS rd1 JOIN rd AS rd2 ON rd2.length <= rd1.length JOIN [20090811_ranges].t_range r ON r.rge_length = rd1.length AND r.rge_startdate = CAST(CAST(CAST(@date AS FLOAT) AS INT) AS DATETIME) - rd2.length AND r.rge_from <= @date AND r.rge_till > @date55 milliseconds. Same result instantly.
It should be noted that but a single range having a length of 100 or more will make the CTE fail due to recursion limit.
This can be adjusted using
MAXRECURSION
option.In this case, the splitting period should be reconsidered (to be a week or even more, depending on the maximum length).
excellent solution, liking the use of spatial too thanks!
Lukek
26 Feb 14 at 17:23