EXPLAIN EXTENDED

How to create fast database queries

Efficient date range query: SQL Server

with one comment

From Stack Overflow:

I have a table A that has a startDate and an endDate as 2 DATETIME columns besides some more other columns.

I have another table B that has one DATETIME column (call it dates column).

This is in SQL Server 2005.

How to best set up the indexes etc to get the following:

SELECT  *
FROM    a, b
WHERE   a.startDate >= b.dates
        AND a.endDate < b.dates
&#91;/sourcecode&#93;

Both tables have several thousand records.
</blockquote>

Queries of this kind involve searching a table for a range that contains a given constant and plain <strong>B-Tree</strong> indexes are not good for this query.

A <strong>B-Tree</strong> index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).

However, this query is just the opposite: we search for a constaint between two unknown values.

If we create the indexes on <code>startDate</code> and <code>endDate</code> and issue a query with a condition like this:

<code>startDate <= value AND endDate > value</code>

, this will just give two range condition on two distinct indexes:

<code>-INF <= startDate <= value</code>
<code>value < endDate <= +INF</code>

Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.

There is a special class of indexes: <a href="http://en.wikipedia.org/wiki/R-tree"><strong>R-Tree</strong></a> indexes, which handle the conditions as described above just perfectly. They are mostly used to query <strong>2</strong>-dimensional <strong>GIS</strong> data, that's why in <strong>MySQL</strong> they are called <code>SPATIAL</code> indexes.

In both these systems they can only be defined over geometry data which implies <strong>2</strong>-dimensional objects (<code>Point</code>s, <code>LineString</code>s and such).

There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in <strong>MySQL</strong>):

<ul>
	<li><a href="/2009/07/01/overlapping-ranges-mysql/"><strong>Overlapping ranges: MySQL</strong></a></li>
</ul>

However, <strong>SQL Server 2005</strong> does not support <strong>SPATIAL</strong> indexes.

In theory, an <strong>R-Tree</strong> index could be built over <strong>1</strong>-dimensional data just as well, but no database of the major four supports it (as for now).

However, quite a close structure can be implemented using plain <strong>B-Tree</strong> indexes, and in this article I'll show how to do it.

Let's create a sample table:
<span id="more-2601"></span>
<a href="#" onclick="xcollapse('X9009');return false;"><strong>Table creation details</strong></a>
<br />
<div id="X9009" style="display: none; ">

CREATE SCHEMA [20090811_ranges]
CREATE TABLE t_range (
        rge_id INT NOT NULL PRIMARY KEY,
        rge_from DATETIME NOT NULL,
        rge_till DATETIME NOT NULL,
        rge_stuffing VARCHAR(400) NOT NULL,
        rge_length AS DATEDIFF(day, rge_from, rge_till),
        rge_startdate AS CONVERT(DATETIME, CONVERT(VARCHAR(10), rge_from, 101), 101)
        )
GO
CREATE INDEX IX_range_from ON [20090811_ranges].t_range (rge_from)
CREATE INDEX IX_range_till ON [20090811_ranges].t_range (rge_till)
CREATE INDEX IX_range_length_startdate ON [20090811_ranges].t_range (rge_length, rge_startdate)
GO
SELECT  RAND(20090811)
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        INSERT
        INTO    &#91;20090811_ranges&#93;.t_range
        VALUES  (
                @cnt,
                DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)),
                DATEADD(minute, -@cnt, CAST('2009-10-08' AS DATETIME)) + RAND() * 10 + 1,
                LEFT('Range ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('*', 400), 400)
                )
        SET @cnt = @cnt + 1
END
COMMIT
&#91;/sourcecode&#93;
</div>

Table <code>t_range</code> has <strong>1,000,000</strong> rows, with ranges starting each minute and having random length up to <strong>11</strong> days.

There are two computed columns in this table: <code>rge_length</code>, which returns a number of midnights between the start and the end of each range, and <code>rge_startdate</code>, which truncates the start's date to <strong>00:00:00</strong>. And there is a composite index on these columns.

Here's the plain query:


DECLARE @date DATETIME
SET @date = '2009-01-01'

SELECT  SUM(LEN(rge_stuffing))
FROM    [20090811_ranges].t_range
WHERE   rge_from <= @date
        AND rge_till > @date

View details

It's more than 14 seconds.

Now, how can we improve the query?

To do this, we need to transform a range condition on both columns into an equality condition which we could later check.

To do this, I just split the starts of the ranges and their lengths into chunks of one day long.

How do we test if the range contains the given date? Say, Jan 1, 2009?

This date may possbily be contained by:

  • All ranges having length of 0 (that is, the range starts and ends on same date) and rge_startdate of Jan 1, 2009
  • All ranges having length of 1 and rge_startdate between Dec 31, 2008 and Jan 1, 2009
  • All ranges having length of 2 and rge_startdate between Dec 30, 2008 and Jan 1, 2009
  • … etc

This list is possibly infinite, however, we know the maximal length possible is 12 (or we would split the ranges into longer chunks), so this list is limited by the maximal possible length.

And each of the conditions above is coverable by plain B-Tree indexes.

To make the conditions even more efficient, we will split each of the ranges into subchunks which can be tested with equality conditions (as opposed to range conditions). This will make HASH JOIN method usable (in case the optimizer will decide to use it).

Here's how we do it:

  • length = 0 AND
    • rge_startdate = @date
  • length = 1 AND
    • rge_startdate = @date
    • OR rge_startdate = @date - 1
  • length = 2 AND
    • rge_startdate = @date
    • OR rge_startdate = @date - 1
    • OR rge_startdate = @date - 2
  • …etc

We can easily constuct the numbers required for the JOIN above using recursive CTEs.

The conditions above, however, are necessary but not sufficient: all ranges containing given @date are guaranteed to satisfy these conditions but there are ranges satisfying the conditions above but not containing the @date.

To fine filter the ranges, we will just add original rge_from <= @date AND rge_till > @date to the WHERE clause of the query.

And here's the query:

DECLARE @date DATETIME
SET @date = '2009-01-01'

;WITH rd AS
        (
        SELECT  MAX(rge_length) AS length
        FROM    [20090811_ranges].t_range
        UNION ALL
        SELECT  length - 1
        FROM    rd
        WHERE   length > 0
        )
SELECT  SUM(LEN(rge_stuffing))
FROM    rd AS rd1
JOIN
        rd AS rd2
ON      rd2.length <= rd1.length
JOIN    &#91;20090811_ranges&#93;.t_range r
ON      r.rge_length = rd1.length
        AND r.rge_startdate = CAST(CAST(CAST(@date AS FLOAT) AS INT) AS DATETIME) - rd2.length
        AND r.rge_from <= @date
        AND r.rge_till > @date

View details

55 milliseconds. Same result instantly.

It should be noted that but a single range having a length of 100 or more will make the CTE fail due to recursion limit.

This can be adjusted using MAXRECURSION option.

In this case, the splitting period should be reconsidered (to be a week or even more, depending on the maximum length).

Written by Quassnoi

August 11th, 2009 at 11:00 pm

Posted in SQL Server

One Response to 'Efficient date range query: SQL Server'

Subscribe to comments with RSS

  1. excellent solution, liking the use of spatial too thanks!

    Lukek

    26 Feb 14 at 17:23

Leave a Reply