SQL Server: efficient DISTINCT on dates at EXPLAIN EXTENDED

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT).

The problem is whether there is any trick to get it done more quickly.

With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

The problem with this query is that no distinct list of truncated dates is easily accessible.

That's why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.

In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME column. Then it would just use jump over the distinct index keys.

Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).

SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.

That's why creating an index on a computed column doesn't help.

But since we are talking about the dates, it's possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.

Let's create a sample table:

Table creation details

CREATE SCHEMA [20090820_distinct]
CREATE TABLE  t_datetime (
id INT NOT NULL PRIMARY KEY,
dtcol DATETIME NOT NULL,
dcol AS CONVERT(DATETIME, CONVERT(CHAR(8), dtcol, 112), 112) PERSISTED,
stuffing VARCHAR(100) NOT NULL
)
GO
CREATE INDEX IX_datetime_dtcol ON [20090820_distinct].t_datetime(dtcol)
CREATE INDEX IX_datetime_dcol ON [20090820_distinct].t_datetime(dcol)
GO
BEGIN TRANSACTION
DECLARE @pk INT
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 2000000
BEGIN
        INSERT
        INTO    [20090820_distinct].t_datetime (id, dtcol, stuffing)
        SELECT  @cnt, DATEADD(minute, -@cnt, '2009-20-08'), LEFT('Value ' + CAST(@cnt AS VARCHAR) + ' ' + REPLICATE('*', 100), 100)
        SET @cnt = @cnt + 1
END
COMMIT

This table has 2,000,000 records with 1,389 distinct dates.

There is an index on a DATETIME column as well as an index on a computed persisted DATE column.

Let's use plain DISTINCT clause on DATETIME column casted to a DATE:

SELECT  COUNT(date)
FROM    (
SELECT  DISTINCT CAST(CAST(CAST(dtcol AS FLOAT) AS INTEGER) AS DATETIME) AS date
FROM    [20090820_distinct].t_datetime
) q

See execution details


1389
1 row fetched in 0.0002s (8.1873s)

Table 't_datetime'. Scan count 3, logical reads 7915, physical reads 9, read-ahead reads 5776, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1422 ms,  elapsed time = 8192 ms.

SELECT
  Compute Scalar
    Stream Aggregate
      Parallelism (Gather Streams)
        Stream Aggregate
          Hash Match (Aggregate)
            Parallelism (Repartition Streams)
              Compute Scalar
                Index Scan [test].[20090820_distinct].[ix_datetime_dtcol]

This takes 8,192 seconds (no pun indended).

Now, same query on a computed column:

SELECT  COUNT(date)
FROM    (
SELECT  DISTINCT dcol AS date
FROM    [20090820_distinct].t_datetime
) q

See execution details


1389
1 row fetched in 0.0002s (9.1873s)

Table 't_datetime'. Scan count 3, logical reads 10833, physical reads 34, read-ahead reads 8393, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 624 ms,  elapsed time = 9183 ms.

SELECT
  Compute Scalar
    Stream Aggregate
      Parallelism (Gather Streams)
        Stream Aggregate
          Hash Match (Aggregate)
            Parallelism (Repartition Streams)
              Compute Scalar
                Index Scan [test].[20090820_distinct].[ix_datetime_dcol]

Same plan, same result, same time. A little saved on conversion.

Now, what can we do?

Since we are talking about dates, it may be reasonable to generate a list of possible dates and just check if every given date exists.

We can do it with a recursive CTE.

We will take MAX(dtcol) and MIN(dtcol) as recursion base, and add a date to the MIN(dtcol) until in reaches MAX(dtcol). This way we will get a list of consecutive dates from MIN(dtcol) to MAX(dtcol).

Then we can check if a single record exists within a given date using, of course, EXISTS. Since but an single index lookup will be performed for each date, this should be fast.

Let's try it:

WITH    rows AS
(
SELECT  CAST(CAST(CAST(MIN(dtcol) AS FLOAT) AS INTEGER) AS DATETIME) AS date,
MAX(dtcol) AS maxdate
FROM    [20090820_distinct].t_datetime
UNION ALL
SELECT  date + 1, maxdate
FROM    rows
WHERE   date &lt; maxdate
        )
SELECT  COUNT(date)
FROM    rows
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    [20090820_distinct].t_datetime
        WHERE   dtcol &gt;= date
AND dtcol &lt; date + 1
        )
OPTION  (MAXRECURSION 0)

See execution details


1389
1 row fetched in 0.0002s (0.0494s)

Table 't_datetime'. Scan count 1392, logical reads 5658, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 2, logical reads 8341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 49 ms.

SELECT
  Compute Scalar
    Stream Aggregate
      Nested Loops (Left Semi Join)
        Index Spool (Lazy Spool)
          Concatenation
            Compute Scalar
              Compute Scalar
                Stream Aggregate
                  Nested Loops (Inner Join)
                    Compute Scalar
                      Top
                        Index Scan [test].[20090820_distinct].[ix_datetime_dtcol]
                    Compute Scalar
                      Top
                        Index Scan [test].[20090820_distinct].[ix_datetime_dtcol]
            Nested Loops (Inner Join)
              Compute Scalar
                Index Spool (Lazy Spool)
              Compute Scalar
                Filter
                  Constant Scan

This yields the same result in 49 ms (almost instantly).

Written by Quassnoi

August 20th, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED

SQL Server: efficient DISTINCT on dates

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

SQL Server: efficient DISTINCT on dates

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow