EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Finding sets within a time frame

Comments enabled. I *really* need your comment

From Stack Overflow:

I’ve got a set of data that contains users, their awards and the time they are granted.

When a user earns an award, the award and user is logged along with a date.

What I’d like is a query that finds a set of 3 achievements that were earned within 5 minutes of each other by the same user.

Any thoughts on how to accomplish this?

I’ll go a step further and try to build a general solution to find sets of N values within a given time frame.

We should return all rows that are parts of all 5-minute time frames containing at least N values. These timeframes can overlap.

We should build such a timeframe for each record, make sure that there are at least N values in this timeframe, and if yes, return these values (applying DISTINCT to get rid of the duplicates).

Let’s create a sample table and build a query to do this:


Table creation details

This table contains 100,000 records for 10 users (10,000 records per user).

The awards are given with random intervals from 50 to 500 seconds.

The query below does the following:

  1. Builds a 5-minute timeframe for each record
  2. Checks that there are at least 5 records in the timeframe and if positive returns all id‘s from this timeframe.
  3. Groups distinct id‘s and joins them back with the table

View the query, results, statistics and execution plan

This is quite fast, 0.6 seconds

However, this query contains range index scans which can only be processed using NESTED LOOPS in SQL Server. And NESTED LOOPS is not the most efficient algorithm here since almost all values need to be processed.

If we could replace the range condition with the equality checks that would give the optimizer the opportunity to replace NESTED LOOPS with more efficient MERGE JOIN or HASH JOIN.

Fortunately, we can do this for this task.

We cannot do equality searches for the dates because they are not well-ordered: we cannot tell the next date given the current one.

However, we can enumerate the rows and do this with the row numbers.

Here’s how we build the query:

  1. Get the user-wise row numbers using ROW_NUMBER(). This is fast since we have a composite index on (userid, granted): no sort.

  2. For each row, find out if its 5-minute timeframe contains at least 5 rows.

    To do this we just need to check that the row with same userid and the number greater by 4 is within the timeframe.

    This is an equijoin, not a range join: endvalue.rn = startvalue.rn + 4.

  3. If the 4th row starting from the current one is within the timeframe, return the next 4 rows.

    Normally, that would require a range condition, but we have ordinal row numbers and can do this using an equijoin just as well.

    To do this, we just need to generate numbers from 0 to 4 using a recursive CTE, add them to the current row number and equijoin on the resulting value. This will give us 5 records we were after.

  4. Group distinct id‘s and joins them back with the table. This is like in the previous quiery.

And here is the query itself:

WITH    rows AS
        (
        SELECT  0 AS row
        UNION ALL
        SELECT  row + 1
        FROM    rows
        WHERE   row < 5 - 1
        ),
        ranges AS
        (
        SELECT  id, userid, granted, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY granted) AS rn
        FROM    [20090922_frame].t_award
        )
SELECT  am.*
FROM    (
        SELECT  DISTINCT rr.id
        FROM    ranges rb
        CROSS JOIN
                rows
        JOIN    ranges rr
        ON      rr.userid = rb.userid
                AND rr.rn = rb.rn + row
        WHERE   EXISTS
                (
                SELECT  NULL
                FROM    ranges re
                WHERE   re.userid = rb.userid
                        AND re.rn = rb.rn + 4
                        AND re.granted <= DATEADD(minute, 5, rb.granted)
                )
        ) rd
JOIN    [20090922_frame].t_award am
ON      am.id = rd.id

View the results, statistics and execution plan

As we can see, the most expensive Nested Loops with Filter (that was used for the COUNT(*) subquery) is replaced with more efficient Hash Match (Right Semi Join) that processes the subquery much faster.

As a result, the query completes in only 0.38 seconds, or almost twice as fast as the original query.

Written by Quassnoi

September 22nd, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply