EXPLAIN EXTENDED

How to create fast database queries

SQL Server: efficient DISTINCT on dates

Comments enabled. I *really* need your comment

From Stack Overflow:

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT).

The problem is whether there is any trick to get it done more quickly.

With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

The problem with this query is that no distinct list of truncated dates is easily accessible.

That's why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.

In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME column. Then it would just use jump over the distinct index keys.

Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).

SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.

That's why creating an index on a computed column doesn't help.

But since we are talking about the dates, it's possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.

Let's create a sample table:


Table creation details

This table has 2,000,000 records with 1,389 distinct dates.

There is an index on a DATETIME column as well as an index on a computed persisted DATE column.

Let's use plain DISTINCT clause on DATETIME column casted to a DATE:

SELECT  COUNT(date)
FROM    (
SELECT  DISTINCT CAST(CAST(CAST(dtcol AS FLOAT) AS INTEGER) AS DATETIME) AS date
FROM    [20090820_distinct].t_datetime
) q

See execution details

This takes 8,192 seconds (no pun indended).

Now, same query on a computed column:

SELECT  COUNT(date)
FROM    (
SELECT  DISTINCT dcol AS date
FROM    [20090820_distinct].t_datetime
) q

See execution details

Same plan, same result, same time. A little saved on conversion.

Now, what can we do?

Since we are talking about dates, it may be reasonable to generate a list of possible dates and just check if every given date exists.

We can do it with a recursive CTE.

We will take MAX(dtcol) and MIN(dtcol) as recursion base, and add a date to the MIN(dtcol) until in reaches MAX(dtcol). This way we will get a list of consecutive dates from MIN(dtcol) to MAX(dtcol).

Then we can check if a single record exists within a given date using, of course, EXISTS. Since but an single index lookup will be performed for each date, this should be fast.

Let's try it:

WITH    rows AS
(
SELECT  CAST(CAST(CAST(MIN(dtcol) AS FLOAT) AS INTEGER) AS DATETIME) AS date,
MAX(dtcol) AS maxdate
FROM    [20090820_distinct].t_datetime
UNION ALL
SELECT  date + 1, maxdate
FROM    rows
WHERE   date < maxdate
        )
SELECT  COUNT(date)
FROM    rows
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    [20090820_distinct].t_datetime
        WHERE   dtcol >= date
AND dtcol < date + 1
        )
OPTION  (MAXRECURSION 0)

See execution details

This yields the same result in 49 ms (almost instantly).

Written by Quassnoi

August 20th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply