EXPLAIN EXTENDED

How to create fast database queries

Finding sets within a time frame

Comments enabled. I *really* need your comment

From Stack Overflow:

I've got a set of data that contains users, their awards and the time they are granted.

When a user earns an award, the award and user is logged along with a date.

What I'd like is a query that finds a set of 3 achievements that were earned within 5 minutes of each other by the same user.

Any thoughts on how to accomplish this?

I'll go a step further and try to build a general solution to find sets of N values within a given time frame.

We should return all rows that are parts of all 5-minute time frames containing at least N values. These timeframes can overlap.

We should build such a timeframe for each record, make sure that there are at least N values in this timeframe, and if yes, return these values (applying DISTINCT to get rid of the duplicates).

Let's create a sample table and build a query to do this:


Table creation details

This is quite fast, 0.6 seconds

However, this query contains range index scans which can only be processed using NESTED LOOPS in SQL Server. And NESTED LOOPS is not the most efficient algorithm here since almost all values need to be processed.

If we could replace the range condition with the equality checks that would give the optimizer the opportunity to replace NESTED LOOPS with more efficient MERGE JOIN or HASH JOIN.

Fortunately, we can do this for this task.

We cannot do equality searches for the dates because they are not well-ordered: we cannot tell the next date given the current one.

However, we can enumerate the rows and do this with the row numbers.

Here's how we build the query:

  1. Get the user-wise row numbers using ROW_NUMBER(). This is fast since we have a composite index on (userid, granted): no sort.

  2. For each row, find out if its 5-minute timeframe contains at least 5 rows.

    To do this we just need to check that the row with same userid and the number greater by 4 is within the timeframe.

    This is an equijoin, not a range join: endvalue.rn = startvalue.rn + 4.

  3. If the 4th row starting from the current one is within the timeframe, return the next 4 rows.

    Normally, that would require a range condition, but we have ordinal row numbers and can do this using an equijoin just as well.

    To do this, we just need to generate numbers from 0 to 4 using a recursive CTE, add them to the current row number and equijoin on the resulting value. This will give us 5 records we were after.

  4. Group distinct id's and joins them back with the table. This is like in the previous quiery.

And here is the query itself:

WITH rows AS
(
SELECT 0 AS row
UNION ALL
SELECT row + 1
FROM rows
WHERE row < 5 - 1 ), ranges AS ( SELECT id, userid, granted, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY granted) AS rn FROM [20090922_frame].t_award ) SELECT am.* FROM ( SELECT DISTINCT rr.id FROM ranges rb CROSS JOIN rows JOIN ranges rr ON rr.userid = rb.userid AND rr.rn = rb.rn + row WHERE EXISTS ( SELECT NULL FROM ranges re WHERE re.userid = rb.userid AND re.rn = rb.rn + 4 AND re.granted <= DATEADD(minute, 5, rb.granted) ) ) rd JOIN [20090922_frame].t_award am ON am.id = rd.id [/sourcecode] View the results, statistics and execution plan

As we can see, the most expensive Nested Loops with Filter (that was used for the COUNT(*) subquery) is replaced with more efficient Hash Match (Right Semi Join) that processes the subquery much faster.

As a result, the query completes in only 0.38 seconds, or almost twice as fast as the original query.

Written by Quassnoi

September 22nd, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply