EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Shared Plan and Algorithm Network Cache (SPANC). You're welcome to read and comment on it.

Efficient date range query: SQL Server

From Stack Overflow:

I have a table A that has a startDate and an endDate as 2 DATETIME columns besides some more other columns.

I have another table B that has one DATETIME column (call it dates column).

This is in SQL Server 2005.

How to best set up the indexes etc to get the following:

SELECT  *
FROM    a, b
WHERE   a.startDate >= b.dates
        AND a.endDate < b.dates

Both tables have several thousand records.

Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.

A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).

However, this query is just the opposite: we search for a constaint between two unknown values.

If we create the indexes on startDate and endDate and issue a query with a condition like this:

startDate <= value AND endDate > value

, this will just give two range condition on two distinct indexes:

-INF <= startDate <= value
value < endDate <= +INF

Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.

There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL indexes.

In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Points, LineStrings and such).

There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):

However, SQL Server 2005 does not support SPATIAL indexes.

In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).

However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.

Let's create a sample table:

Table creation details

Table t_range has 1,000,000 rows, with ranges starting each minute and having random length up to 11 days.

There are two computed columns in this table: rge_length, which returns a number of midnights between the start and the end of each range, and rge_startdate, which truncates the start's date to 00:00:00. And there is a composite index on these columns.

Here's the plain query:

DECLARE @date DATETIME
SET @date = '2009-01-01'

SELECT  SUM(LEN(rge_stuffing))
FROM    [20090811_ranges].t_range
WHERE   rge_from <= @date
        AND rge_till > @date

View details

It's more than 14 seconds.

Now, how can we improve the query?

To do this, we need to transform a range condition on both columns into an equality condition which we could later check.

To do this, I just split the starts of the ranges and their lengths into chunks of one day long.

How do we test if the range contains the given date? Say, Jan 1, 2009?

This date may possbily be contained by:

  • All ranges having length of 0 (that is, the range starts and ends on same date) and rge_startdate of Jan 1, 2009
  • All ranges having length of 1 and rge_startdate between Dec 31, 2008 and Jan 1, 2009
  • All ranges having length of 2 and rge_startdate between Dec 30, 2008 and Jan 1, 2009
  • … etc

This list is possibly infinite, however, we know the maximal length possible is 12 (or we would split the ranges into longer chunks), so this list is limited by the maximal possible length.

And each of the conditions above is coverable by plain B-Tree indexes.

To make the conditions even more efficient, we will split each of the ranges into subchunks which can be tested with equality conditions (as opposed to range conditions). This will make HASH JOIN method usable (in case the optimizer will decide to use it).

Here's how we do it:

  • length = 0 AND
    • rge_startdate = @date
  • length = 1 AND
    • rge_startdate = @date
    • OR rge_startdate = @date - 1
  • length = 2 AND
    • rge_startdate = @date
    • OR rge_startdate = @date - 1
    • OR rge_startdate = @date - 2
  • …etc

We can easily constuct the numbers required for the JOIN above using recursive CTEs.

The conditions above, however, are necessary but not sufficient: all ranges containing given @date are guaranteed to satisfy these conditions but there are ranges satisfying the conditions above but not containing the @date.

To fine filter the ranges, we will just add original rge_from <= @date AND rge_till > @date to the WHERE clause of the query.

And here's the query:

DECLARE @date DATETIME
SET @date = '2009-01-01'

;WITH rd AS
        (
        SELECT  MAX(rge_length) AS length
        FROM    [20090811_ranges].t_range
        UNION ALL
        SELECT  length - 1
        FROM    rd
        WHERE   length > 0
        )
SELECT  SUM(LEN(rge_stuffing))
FROM    rd AS rd1
JOIN
        rd AS rd2
ON      rd2.length <= rd1.length
JOIN    [20090811_ranges].t_range r
ON      r.rge_length = rd1.length
        AND r.rge_startdate = CAST(CAST(CAST(@date AS FLOAT) AS INT) AS DATETIME) - rd2.length
        AND r.rge_from <= @date
        AND r.rge_till > @date

View details

55 milliseconds. Same result instantly.

It should be noted that but a single range having a length of 100 or more will make the CTE fail due to recursion limit.

This can be adjusted using MAXRECURSION option.

In this case, the splitting period should be reconsidered (to be a week or even more, depending on the maximum length).

Written by Quassnoi

August 11th, 2009 at 11:00 pm

Posted in SQL Server

Comments are closed.