SQL Server: efficient DISTINCT on dates
Comments enabled. I *really* need your comment
From Stack Overflow:
I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.
My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using
DISTINCT
).The problem is whether there is any trick to get it done more quickly.
With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.
The problem with this query is that no distinct list of truncated dates is easily accessible.
That's why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.
In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME
column. Then it would just use jump over the distinct index keys.
Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).
SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.
That's why creating an index on a computed column doesn't help.
But since we are talking about the dates, it's possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.
Let's create a sample table:
This table has 2,000,000 records with 1,389 distinct dates.
There is an index on a DATETIME
column as well as an index on a computed persisted DATE column.
Let's use plain DISTINCT
clause on DATETIME
column casted to a DATE
:
SELECT COUNT(date) FROM ( SELECT DISTINCT CAST(CAST(CAST(dtcol AS FLOAT) AS INTEGER) AS DATETIME) AS date FROM [20090820_distinct].t_datetime ) q
This takes 8,192 seconds (no pun indended).
Now, same query on a computed column:
SELECT COUNT(date) FROM ( SELECT DISTINCT dcol AS date FROM [20090820_distinct].t_datetime ) q
Same plan, same result, same time. A little saved on conversion.
Now, what can we do?
Since we are talking about dates, it may be reasonable to generate a list of possible dates and just check if every given date exists.
We can do it with a recursive CTE.
We will take MAX(dtcol)
and MIN(dtcol)
as recursion base, and add a date to the MIN(dtcol)
until in reaches MAX(dtcol)
. This way we will get a list of consecutive dates from MIN(dtcol)
to MAX(dtcol)
.
Then we can check if a single record exists within a given date using, of course, EXISTS
. Since but an single index lookup will be performed for each date, this should be fast.
Let's try it:
WITH rows AS ( SELECT CAST(CAST(CAST(MIN(dtcol) AS FLOAT) AS INTEGER) AS DATETIME) AS date, MAX(dtcol) AS maxdate FROM [20090820_distinct].t_datetime UNION ALL SELECT date + 1, maxdate FROM rows WHERE date < maxdate ) SELECT COUNT(date) FROM rows WHERE EXISTS ( SELECT NULL FROM [20090820_distinct].t_datetime WHERE dtcol >= date AND dtcol < date + 1 ) OPTION (MAXRECURSION 0)
This yields the same result in 49 ms (almost instantly).