Finding sets within a time frame at EXPLAIN EXTENDED

I've got a set of data that contains users, their awards and the time they are granted.

When a user earns an award, the award and user is logged along with a date.

What I'd like is a query that finds a set of 3 achievements that were earned within 5 minutes of each other by the same user.

Any thoughts on how to accomplish this?

I'll go a step further and try to build a general solution to find sets of N values within a given time frame.

We should return all rows that are parts of all 5-minute time frames containing at least N values. These timeframes can overlap.

We should build such a timeframe for each record, make sure that there are at least N values in this timeframe, and if yes, return these values (applying DISTINCT to get rid of the duplicates).

Let's create a sample table and build a query to do this:

Table creation details

CREATE SCHEMA [20090922_frame]
CREATE TABLE [20090922_frame].t_award (
        id INT NOT NULL PRIMARY KEY,
        userid INT NOT NULL,
        granted DATETIME NOT NULL,
        value VARCHAR(20) NOT NULL
)
GO
CREATE INDEX IX_award_userid_granted ON [20090922_frame].t_award (userid, granted)
BEGIN TRANSACTION
SELECT  RAND(0.20090922)
DECLARE @user INT
DECLARE @cnt INT
DECLARE @id INT
DECLARE @cdate DATETIME;
SET @user = 1
SET @id = 1
WHILE @user <= 10
BEGIN
        SET @cnt = 1
        SET @cdate = CONVERT(DATETIME, '20090922', 112)
        WHILE @cnt <= 10000
        BEGIN
                INSERT
                INTO    &#91;20090922_frame&#93;.t_award (id, userid, granted, value)
                VALUES  (@id, @user, @cdate, 'Award ' + CAST(@id AS VARCHAR))
                SET @cdate = DATEADD(second, RAND() * 450 + 50, @cdate)
                SET @cnt = @cnt + 1
                SET @id = @id + 1
        END
        SET @user = @user + 1
END
COMMIT
&#91;/sourcecode&#93;
</div>

This table contains <strong>100,000</strong> records for <strong>10</strong> users (<strong>10,000</strong> records per user).

The awards are given with random intervals from <strong>50</strong> to <strong>500</strong> seconds.

The query below does the following:

<ol>
	<li>Builds a 5-minute timeframe for each record</li>
	<li>Checks that there are at least <strong>5</strong> records in the timeframe and if positive returns all <code>id</code>'s from this timeframe.</li>
	<li>Groups distinct <code>id</code>'s and joins them back with the table</li>
</ol>


<a href="#" onclick="xcollapse('X9812');return false;"><strong>View the query, results, statistics and execution plan</strong></a>
<br />
<div id="X9812" style="display: none; ">

SELECT  am.*
FROM    (
        SELECT  DISTINCT ae.id
        FROM    [20090922_frame].t_award ab
        JOIN    [20090922_frame].t_award ae
        ON      ae.userid = ab.userid
                AND ae.granted BETWEEN ab.granted AND DATEADD(minute, 5, ab.granted)
        WHERE   (
                SELECT  COUNT(*)
                FROM    [20090922_frame].t_award ai
                WHERE   ai.userid = ab.userid
                        AND ai.granted BETWEEN ab.granted AND DATEADD(minute, 5, ab.granted)
                ) >= 5
        ) ad
JOIN    [20090922_frame].t_award am
ON      am.id = ad.id

id	userid	granted	value
2357	1	2009-09-29 11:42:22.000	Award 2357
9560	1	2009-10-22 06:57:12.000	Award 9560
9561	1	2009-10-22 06:58:27.000	Award 9561
16521	2	2009-10-12 17:36:33.000	Award 16521
39810	4	2009-10-23 01:11:55.000	Award 39810
39811	4	2009-10-23 01:12:54.000	Award 39811
39812	4	2009-10-23 01:14:25.000	Award 39812
40495	5	2009-09-23 13:44:56.000	Award 40495
40496	5	2009-09-23 13:46:06.000	Award 40496
40497	5	2009-09-23 13:47:17.000	Award 40497
47155	5	2009-10-14 15:24:54.000	Award 47155
52979	6	2009-10-01 12:42:45.000	Award 52979
53547	6	2009-10-03 08:05:23.000	Award 53547
75240	8	2009-10-08 16:52:40.000	Award 75240
75241	8	2009-10-08 16:53:52.000	Award 75241
75242	8	2009-10-08 16:55:28.000	Award 75242
75243	8	2009-10-08 16:56:20.000	Award 75243
2353	1	2009-09-29 11:37:22.000	Award 2353
2354	1	2009-09-29 11:38:22.000	Award 2354
2355	1	2009-09-29 11:40:02.000	Award 2355
2356	1	2009-09-29 11:41:06.000	Award 2356
9557	1	2009-10-22 06:54:13.000	Award 9557
9558	1	2009-10-22 06:55:07.000	Award 9558
9559	1	2009-10-22 06:56:09.000	Award 9559
16517	2	2009-10-12 17:31:37.000	Award 16517
16518	2	2009-10-12 17:32:48.000	Award 16518
16519	2	2009-10-12 17:33:59.000	Award 16519
16520	2	2009-10-12 17:35:07.000	Award 16520
39807	4	2009-10-23 01:08:26.000	Award 39807
39808	4	2009-10-23 01:09:43.000	Award 39808
39809	4	2009-10-23 01:10:48.000	Award 39809
40493	5	2009-09-23 13:42:25.000	Award 40493
40494	5	2009-09-23 13:43:18.000	Award 40494
47151	5	2009-10-14 15:19:56.000	Award 47151
47152	5	2009-10-14 15:21:03.000	Award 47152
47153	5	2009-10-14 15:22:16.000	Award 47153
47154	5	2009-10-14 15:23:59.000	Award 47154
52975	6	2009-10-01 12:37:53.000	Award 52975
52976	6	2009-10-01 12:39:18.000	Award 52976
52977	6	2009-10-01 12:40:22.000	Award 52977
52978	6	2009-10-01 12:41:44.000	Award 52978
53543	6	2009-10-03 08:01:01.000	Award 53543
53544	6	2009-10-03 08:02:08.000	Award 53544
53545	6	2009-10-03 08:03:42.000	Award 53545
53546	6	2009-10-03 08:04:32.000	Award 53546
75239	8	2009-10-08 16:51:42.000	Award 75239
46 rows fetched in 0.0020s (0.5937s)

Table 't_award'. Scan count 100009, logical reads 424139, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1124 ms,  elapsed time = 586 ms.

  |--Parallelism(Gather Streams)
       |--Hash Match(Inner Join, HASH:([ae].[id])=([am].[id]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ae].[id]))
            |    |--Hash Match(Left Semi Join, HASH:([ae].[userid])=([ab].[userid]), RESIDUAL:([test].[20090922_frame].[t_award].[userid] as [ae].[userid]=[test].[20090922_frame].[t_award].[userid] as [ab].[userid] AND [test].[20090922_frame].[t_award].[granted] as [ae].[granted]>=[test].[20090922_frame].[t_award].[granted] as [ab].[granted] AND [test].[20090922_frame].[t_award].[granted] as [ae].[granted]<=[Expr1010]))
            |         |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ae].[userid]))
            |         |    |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ae]))
            |         |--Filter(WHERE:([Expr1006]>=(5)))
            |              |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ab].[userid]))
            |                   |--Nested Loops(Inner Join, OUTER REFERENCES:([ab].[userid], [ab].[granted], [Expr1014]) WITH UNORDERED PREFETCH)
            |                        |--Compute Scalar(DEFINE:([Expr1010]=dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted] as [ab].[granted])))
            |                        |    |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ab]))
            |                        |--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[Expr1013],0)))
            |                             |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
            |                                  |--Index Seek(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ai]), SEEK:([ai].[userid]=[test].[20090922_frame].[t_award].[userid] as [ab].[userid] AND [ai].[granted] >= [test].[20090922_frame].[t_award].[granted] as [ab].[granted] AND [ai].[granted] <= dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted] as [ab].[granted])) ORDERED FORWARD)
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([am].[id]))
                 |--Clustered Index Scan(OBJECT:([test].[20090922_frame].[t_award].[PK__t_award__3C94E422] AS [am]))

This is quite fast, 0.6 seconds

However, this query contains range index scans which can only be processed using NESTED LOOPS in SQL Server. And NESTED LOOPS is not the most efficient algorithm here since almost all values need to be processed.

If we could replace the range condition with the equality checks that would give the optimizer the opportunity to replace NESTED LOOPS with more efficient MERGE JOIN or HASH JOIN.

Fortunately, we can do this for this task.

We cannot do equality searches for the dates because they are not well-ordered: we cannot tell the next date given the current one.

However, we can enumerate the rows and do this with the row numbers.

Here's how we build the query:

Get the user-wise row numbers using ROW_NUMBER(). This is fast since we have a composite index on (userid, granted): no sort.
For each row, find out if its 5-minute timeframe contains at least 5 rows.

To do this we just need to check that the row with same userid and the number greater by 4 is within the timeframe.

This is an equijoin, not a range join: endvalue.rn = startvalue.rn + 4.
If the 4th row starting from the current one is within the timeframe, return the next 4 rows.

Normally, that would require a range condition, but we have ordinal row numbers and can do this using an equijoin just as well.

To do this, we just need to generate numbers from 0 to 4 using a recursive CTE, add them to the current row number and equijoin on the resulting value. This will give us 5 records we were after.
Group distinct id's and joins them back with the table. This is like in the previous quiery.

And here is the query itself:

WITH rows AS
(
SELECT 0 AS row
UNION ALL
SELECT row + 1
FROM rows
WHERE row < 5 - 1 ), ranges AS ( SELECT id, userid, granted, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY granted) AS rn FROM [20090922_frame].t_award ) SELECT am.* FROM ( SELECT DISTINCT rr.id FROM ranges rb CROSS JOIN rows JOIN ranges rr ON rr.userid = rb.userid AND rr.rn = rb.rn + row WHERE EXISTS ( SELECT NULL FROM ranges re WHERE re.userid = rb.userid AND re.rn = rb.rn + 4 AND re.granted <= DATEADD(minute, 5, rb.granted) ) ) rd JOIN [20090922_frame].t_award am ON am.id = rd.id [/sourcecode] View the results, statistics and execution plan

id	userid	granted	value
2353	1	2009-09-29 11:37:22.000	Award 2353
2354	1	2009-09-29 11:38:22.000	Award 2354
2355	1	2009-09-29 11:40:02.000	Award 2355
2356	1	2009-09-29 11:41:06.000	Award 2356
9557	1	2009-10-22 06:54:13.000	Award 9557
9558	1	2009-10-22 06:55:07.000	Award 9558
9559	1	2009-10-22 06:56:09.000	Award 9559
16517	2	2009-10-12 17:31:37.000	Award 16517
16518	2	2009-10-12 17:32:48.000	Award 16518
16519	2	2009-10-12 17:33:59.000	Award 16519
16520	2	2009-10-12 17:35:07.000	Award 16520
39807	4	2009-10-23 01:08:26.000	Award 39807
39808	4	2009-10-23 01:09:43.000	Award 39808
39809	4	2009-10-23 01:10:48.000	Award 39809
40493	5	2009-09-23 13:42:25.000	Award 40493
40494	5	2009-09-23 13:43:18.000	Award 40494
47151	5	2009-10-14 15:19:56.000	Award 47151
47152	5	2009-10-14 15:21:03.000	Award 47152
47153	5	2009-10-14 15:22:16.000	Award 47153
47154	5	2009-10-14 15:23:59.000	Award 47154
52975	6	2009-10-01 12:37:53.000	Award 52975
52976	6	2009-10-01 12:39:18.000	Award 52976
52977	6	2009-10-01 12:40:22.000	Award 52977
52978	6	2009-10-01 12:41:44.000	Award 52978
53543	6	2009-10-03 08:01:01.000	Award 53543
53544	6	2009-10-03 08:02:08.000	Award 53544
53545	6	2009-10-03 08:03:42.000	Award 53545
53546	6	2009-10-03 08:04:32.000	Award 53546
75239	8	2009-10-08 16:51:42.000	Award 75239
2357	1	2009-09-29 11:42:22.000	Award 2357
9560	1	2009-10-22 06:57:12.000	Award 9560
9561	1	2009-10-22 06:58:27.000	Award 9561
16521	2	2009-10-12 17:36:33.000	Award 16521
39810	4	2009-10-23 01:11:55.000	Award 39810
39811	4	2009-10-23 01:12:54.000	Award 39811
39812	4	2009-10-23 01:14:25.000	Award 39812
40495	5	2009-09-23 13:44:56.000	Award 40495
40496	5	2009-09-23 13:46:06.000	Award 40496
40497	5	2009-09-23 13:47:17.000	Award 40497
47155	5	2009-10-14 15:24:54.000	Award 47155
52979	6	2009-10-01 12:42:45.000	Award 52979
53547	6	2009-10-03 08:05:23.000	Award 53547
75240	8	2009-10-08 16:52:40.000	Award 75240
75241	8	2009-10-08 16:53:52.000	Award 75241
75242	8	2009-10-08 16:55:28.000	Award 75242
75243	8	2009-10-08 16:56:20.000	Award 75243
46 rows fetched in 0.0025s (0.3807s)

Table 'Worktable'. Scan count 4, logical reads 79, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_award'. Scan count 8, logical reads 2639, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 702 ms,  elapsed time = 380 ms.

  |--Parallelism(Gather Streams)
       |--Merge Join(Inner Join, MERGE:([test].[20090922_frame].[t_award].[id])=([am].[id]), RESIDUAL:([test].[20090922_frame].[t_award].[id] as [am].[id]=[test].[20090922_frame].[t_award].[id]))
            |--Sort(ORDER BY:([test].[20090922_frame].[t_award].[id] ASC))
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[id]))
            |         |--Hash Match(Right Semi Join, HASH:([test].[20090922_frame].[t_award].[userid], [Expr1020])=([test].[20090922_frame].[t_award].[userid], [Expr1011]), RESIDUAL:([test].[20090922_frame].[t_award].[userid]=[test].[20090922_frame].[t_award].[userid] AND [Expr1011]=[Expr1020]))
            |              |--Bitmap(HASH:([test].[20090922_frame].[t_award].[userid], [Expr1020]), DEFINE:([Bitmap1032]))
            |              |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[userid], [Expr1020]))
            |              |         |--Compute Scalar(DEFINE:([Expr1020]=[Expr1003]+CONVERT_IMPLICIT(bigint,[Recr1007],0)))
            |              |              |--Nested Loops(Inner Join)
            |              |                   |--Parallelism(Distribute Streams, RoundRobin Partitioning)
            |              |                   |    |--Index Spool(WITH STACK)
            |              |                   |         |--Concatenation
            |              |                   |              |--Compute Scalar(DEFINE:([Expr1022]=(0)))
            |              |                   |              |    |--Constant Scan(VALUES:(((0))))
            |              |                   |              |--Assert(WHERE:(CASE WHEN [Expr1024]>(100) THEN (0) ELSE NULL END))
            |              |                   |                   |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1024], [Recr1005]))
            |              |                   |                        |--Compute Scalar(DEFINE:([Expr1024]=[Expr1023]+(1)))
            |              |                   |                        |    |--Table Spool(WITH STACK)
            |              |                   |                        |--Compute Scalar(DEFINE:([Expr1006]=[Recr1005]+(1)))
            |              |                   |                             |--Filter(WHERE:(STARTUP EXPR([Recr1005]<(4))))
            |              |                   |                                  |--Constant Scan
            |              |                   |--Table Spool
            |              |                        |--Merge Join(Right Semi Join, MANY-TO-MANY MERGE:([test].[20090922_frame].[t_award].[userid], [Expr1015])=([test].[20090922_frame].[t_award].[userid], [Expr1021]), RESIDUAL:([test].[20090922_frame].[t_award].[userid]=[test].[20090922_frame].[t_award].[userid] AND [Expr1015]=([Expr1003]+(4)) AND [test].[20090922_frame].[t_award].[granted]<=[Expr1019]))
            |              |                             |--Sequence Project(DEFINE:([Expr1015]=row_number))
            |              |                             |    |--Compute Scalar(DEFINE:([Expr1027]=(1)))
            |              |                             |         |--Segment
            |              |                             |              |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
            |              |                             |--Compute Scalar(DEFINE:([Expr1021]=[Expr1003]+(4)))
            |              |                                  |--Compute Scalar(DEFINE:([Expr1019]=dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted])))
            |              |                                       |--Sequence Project(DEFINE:([Expr1003]=row_number))
            |              |                                            |--Compute Scalar(DEFINE:([Expr1029]=(1)))
            |              |                                                 |--Segment
            |              |                                                      |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
            |              |--Parallelism(Distribute Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[userid], [Expr1011]), WHERE:(PROBE([Bitmap1032])=TRUE))
            |                   |--Sequence Project(DEFINE:([Expr1011]=row_number))
            |                        |--Compute Scalar(DEFINE:([Expr1031]=(1)))
            |                             |--Segment
            |                                  |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([am].[id]), ORDER BY:([am].[id] ASC))
                 |--Clustered Index Scan(OBJECT:([test].[20090922_frame].[t_award].[PK__t_award__3C94E422] AS [am]), ORDERED FORWARD)

As we can see, the most expensive Nested Loops with Filter (that was used for the COUNT(*) subquery) is replaced with more efficient Hash Match (Right Semi Join) that processes the subquery much faster.

As a result, the query completes in only 0.38 seconds, or almost twice as fast as the original query.

Written by Quassnoi

September 22nd, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED

Finding sets within a time frame

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow