Selecting birthdays at EXPLAIN EXTENDED

Selecting birthdays

Comments enabled. I *really* need your comment

Answering questions asked on the site.

James asks:

I'm developing a forum and want to select all users that have a birthday within the next 3 days.

How do I do it?

This is in SQL Server

This is a very nice feature which every decent forum should have, and I'll be glad to answer this question.

Unfortunately you didn't provide the names of your tables, so I'll have to make them up.

It's not much of a stretch to assume that your table is called t_user and you keep the user's birthdates in a DATETIME field called birthdate.

A birthday within next 3 days means that if you add the person's age to the person's birthdate, you get a date between the current date and three days after it.

To check this, we just need to calculate the number of months between the dates and make sure that it's divisible by 12 with reminder of 0 or 11 (to handle month transitions).

Then we need to add a transition month and divide the number of months by 12. The quotient will give us the number of years we need to add to the birthdate to compare the result with GETDATE().

Let's create a sample table and see how to do it:

CREATE SCHEMA [20090709_birthday]
CREATE TABLE t_user (
        id INT NOT NULL PRIMARY KEY,
        birthdate DATETIME NOT NULL,
        stuffing VARCHAR(100) NOT NULL
)
GO
BEGIN TRANSACTION
SELECT  RAND(20090709)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        INSERT
        INTO    &#91;20090709_birthday&#93;.t_user
        VALUES  (
                @cnt,
                DATEADD(day, 365 * (LOG(0.001 + RAND() * 0.999) * 13 - 15), '2009-09-07'),
                LEFT('User ' + CAST(@cnt AS VARCHAR) + REPLICATE(' ', 300), 300)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
CREATE INDEX ix_user_birthdate ON &#91;20090709_birthday&#93;.t_user (birthdate)
GO
&#91;/sourcecode&#93;

There are <strong>1,000,000</strong> users with logarithmic distribution of ages from <strong>15</strong> to <strong>106</strong> years. This is of course not exactly what the age distribution of a typical forum looks like, but quite close to it to demonstrate my point.

Here's the query:


DECLARE @today DATETIME
SET @today = '2009-07-09';
SELECT  MONTH(birthdate), DAY(birthdate), COUNT(*)
FROM    [20090709_birthday].t_user
WHERE   DATEDIFF(month, birthdate, @today) % 12 IN (11, 0)
        AND DATEADD(year, (DATEDIFF(month, birthdate, @today) + 1) / 12, birthdate) BETWEEN @today AND DATEADD(day, 3, @today)
GROUP BY
        MONTH(birthdate), DAY(birthdate)
ORDER BY
        MONTH(birthdate), DAY(birthdate)


7	9	2872
7	10	2867
7	11	2782
7	12	2853
4 rows fetched in 0.0003s (0.2526s)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_user'. Scan count 3, logical reads 2045, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 252 ms.

This uses INDEX SCAN and runs for 252 ms, which is quite a load on a crowded forum.

Can we optimize this query somehow?

Yes, we can.

The condition we use here is not sargable, that is it cannot use an index to do range seeks, since the indexed value is wrapped into the function.

What if we don't add the birthdate to get the current date, but do the opposite: subtract a number of years from the current date and compare it with the birthdate?

We cannot do it using only a WHERE condition: we need to know the birthdate to know how many years to subtract, and we need to know the number of years to filter the birthdate on.

To work around it, we can generate a dummy resultset using a recursive CTE. This will give us a set of possible years to subtract from the current date, and we can seek a number of ranges instead of scanning the whole index.

Since 3 days of a year gives us less than 1% of values, this should be much more efficient.

However, we don't know exactly how many years to generate. It's widely believed that people can live for 120 years at most, but who knows?

But in fact, we don't need to cover all ranges. People over 80 don't hang on forums too much.

So we can make this query in two parts: use range seeks for the most popular ages, say, up to 80 years old, and then just perform single range scan for all people older than that, using the condition from the query above.

The latter condition will not return as many rows anyway, so checking a dozen of old men won't hurt much.

Let's create the query:

DECLARE @today DATETIME
SET @today = '2009-09-07';
WITH cal AS (
SELECT 1 AS c_year
UNION ALL
SELECT c_year + 1
FROM cal
WHERE c_year < 80 ) SELECT MONTH(birthdate), DAY(birthdate), COUNT(*) FROM ( SELECT birthdate FROM cal JOIN [20090709_birthday].t_user ON birthdate BETWEEN DATEADD(year, -c_year, @today) AND DATEADD(year, -c_year, @today) + 3 UNION ALL SELECT birthdate FROM [20090709_birthday].t_user WHERE birthdate < DATEADD(year, -80, @today) AND DATEDIFF(month, birthdate, @today) % 12 IN (11, 0) AND DATEADD(year, (DATEDIFF(month, birthdate, @today) + 1) / 12, birthdate) BETWEEN @today AND DATEADD(day, 3, @today) ) q GROUP BY MONTH(birthdate), DAY(birthdate) ORDER BY MONTH(birthdate), DAY(birthdate) [/sourcecode]


9	7	2646
9	8	2552
9	9	2655
9	10	2708
4 rows fetched in 0.0003s (0.0153s)

Table 'Worktable'. Scan count 2, logical reads 481, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_user'. Scan count 81, logical reads 354, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 15 ms.

This query consists of two parts merged with a UNION ALL.

The first query joins the generated rowset of all possible years from 1 to 80 with the table and uses the NESTED LOOPS with an INDEX SEEK on each step.

The second query selects every person older than 80 years and just checks them with the same condition we used above. This query returns very few rows so this check completes in no time.

Overall query time is 15 ms, which is more than 10 times faster than the original query which uses pure comparison.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

July 9th, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED