From Stack Overflow:
I've got a set of data that contains users, their awards and the time they are granted.
When a user earns an award, the award and user is logged along with a date.
What I'd like is a query that finds a set of 3 achievements that were earned within 5 minutes of each other by the same user.
Any thoughts on how to accomplish this?
I'll go a step further and try to build a general solution to find sets of N
values within a given time frame.
We should return all rows that are parts of all 5-minute time frames containing at least N
values. These timeframes can overlap.
We should build such a timeframe for each record, make sure that there are at least N
values in this timeframe, and if yes, return these values (applying DISTINCT
to get rid of the duplicates).
Let's create a sample table and build a query to do this:
Table creation details
CREATE SCHEMA [20090922_frame]
CREATE TABLE [20090922_frame].t_award (
id INT NOT NULL PRIMARY KEY,
userid INT NOT NULL,
granted DATETIME NOT NULL,
value VARCHAR(20) NOT NULL
)
GO
CREATE INDEX IX_award_userid_granted ON [20090922_frame].t_award (userid, granted)
BEGIN TRANSACTION
SELECT RAND(0.20090922)
DECLARE @user INT
DECLARE @cnt INT
DECLARE @id INT
DECLARE @cdate DATETIME;
SET @user = 1
SET @id = 1
WHILE @user <= 10
BEGIN
SET @cnt = 1
SET @cdate = CONVERT(DATETIME, '20090922', 112)
WHILE @cnt <= 10000
BEGIN
INSERT
INTO [20090922_frame].t_award (id, userid, granted, value)
VALUES (@id, @user, @cdate, 'Award ' + CAST(@id AS VARCHAR))
SET @cdate = DATEADD(second, RAND() * 450 + 50, @cdate)
SET @cnt = @cnt + 1
SET @id = @id + 1
END
SET @user = @user + 1
END
COMMIT
[/sourcecode]
</div>
This table contains <strong>100,000</strong> records for <strong>10</strong> users (<strong>10,000</strong> records per user).
The awards are given with random intervals from <strong>50</strong> to <strong>500</strong> seconds.
The query below does the following:
<ol>
<li>Builds a 5-minute timeframe for each record</li>
<li>Checks that there are at least <strong>5</strong> records in the timeframe and if positive returns all <code>id</code>'s from this timeframe.</li>
<li>Groups distinct <code>id</code>'s and joins them back with the table</li>
</ol>
<a href="#" onclick="xcollapse('X9812');return false;"><strong>View the query, results, statistics and execution plan</strong></a>
<br />
<div id="X9812" style="display: none; ">
SELECT am.*
FROM (
SELECT DISTINCT ae.id
FROM [20090922_frame].t_award ab
JOIN [20090922_frame].t_award ae
ON ae.userid = ab.userid
AND ae.granted BETWEEN ab.granted AND DATEADD(minute, 5, ab.granted)
WHERE (
SELECT COUNT(*)
FROM [20090922_frame].t_award ai
WHERE ai.userid = ab.userid
AND ai.granted BETWEEN ab.granted AND DATEADD(minute, 5, ab.granted)
) >= 5
) ad
JOIN [20090922_frame].t_award am
ON am.id = ad.id
id |
userid |
granted |
value |
2357 |
1 |
2009-09-29 11:42:22.000 |
Award 2357 |
9560 |
1 |
2009-10-22 06:57:12.000 |
Award 9560 |
9561 |
1 |
2009-10-22 06:58:27.000 |
Award 9561 |
16521 |
2 |
2009-10-12 17:36:33.000 |
Award 16521 |
39810 |
4 |
2009-10-23 01:11:55.000 |
Award 39810 |
39811 |
4 |
2009-10-23 01:12:54.000 |
Award 39811 |
39812 |
4 |
2009-10-23 01:14:25.000 |
Award 39812 |
40495 |
5 |
2009-09-23 13:44:56.000 |
Award 40495 |
40496 |
5 |
2009-09-23 13:46:06.000 |
Award 40496 |
40497 |
5 |
2009-09-23 13:47:17.000 |
Award 40497 |
47155 |
5 |
2009-10-14 15:24:54.000 |
Award 47155 |
52979 |
6 |
2009-10-01 12:42:45.000 |
Award 52979 |
53547 |
6 |
2009-10-03 08:05:23.000 |
Award 53547 |
75240 |
8 |
2009-10-08 16:52:40.000 |
Award 75240 |
75241 |
8 |
2009-10-08 16:53:52.000 |
Award 75241 |
75242 |
8 |
2009-10-08 16:55:28.000 |
Award 75242 |
75243 |
8 |
2009-10-08 16:56:20.000 |
Award 75243 |
2353 |
1 |
2009-09-29 11:37:22.000 |
Award 2353 |
2354 |
1 |
2009-09-29 11:38:22.000 |
Award 2354 |
2355 |
1 |
2009-09-29 11:40:02.000 |
Award 2355 |
2356 |
1 |
2009-09-29 11:41:06.000 |
Award 2356 |
9557 |
1 |
2009-10-22 06:54:13.000 |
Award 9557 |
9558 |
1 |
2009-10-22 06:55:07.000 |
Award 9558 |
9559 |
1 |
2009-10-22 06:56:09.000 |
Award 9559 |
16517 |
2 |
2009-10-12 17:31:37.000 |
Award 16517 |
16518 |
2 |
2009-10-12 17:32:48.000 |
Award 16518 |
16519 |
2 |
2009-10-12 17:33:59.000 |
Award 16519 |
16520 |
2 |
2009-10-12 17:35:07.000 |
Award 16520 |
39807 |
4 |
2009-10-23 01:08:26.000 |
Award 39807 |
39808 |
4 |
2009-10-23 01:09:43.000 |
Award 39808 |
39809 |
4 |
2009-10-23 01:10:48.000 |
Award 39809 |
40493 |
5 |
2009-09-23 13:42:25.000 |
Award 40493 |
40494 |
5 |
2009-09-23 13:43:18.000 |
Award 40494 |
47151 |
5 |
2009-10-14 15:19:56.000 |
Award 47151 |
47152 |
5 |
2009-10-14 15:21:03.000 |
Award 47152 |
47153 |
5 |
2009-10-14 15:22:16.000 |
Award 47153 |
47154 |
5 |
2009-10-14 15:23:59.000 |
Award 47154 |
52975 |
6 |
2009-10-01 12:37:53.000 |
Award 52975 |
52976 |
6 |
2009-10-01 12:39:18.000 |
Award 52976 |
52977 |
6 |
2009-10-01 12:40:22.000 |
Award 52977 |
52978 |
6 |
2009-10-01 12:41:44.000 |
Award 52978 |
53543 |
6 |
2009-10-03 08:01:01.000 |
Award 53543 |
53544 |
6 |
2009-10-03 08:02:08.000 |
Award 53544 |
53545 |
6 |
2009-10-03 08:03:42.000 |
Award 53545 |
53546 |
6 |
2009-10-03 08:04:32.000 |
Award 53546 |
75239 |
8 |
2009-10-08 16:51:42.000 |
Award 75239 |
46 rows fetched in 0.0020s (0.5937s) |
Table 't_award'. Scan count 100009, logical reads 424139, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1124 ms, elapsed time = 586 ms.
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([ae].[id])=([am].[id]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ae].[id]))
| |--Hash Match(Left Semi Join, HASH:([ae].[userid])=([ab].[userid]), RESIDUAL:([test].[20090922_frame].[t_award].[userid] as [ae].[userid]=[test].[20090922_frame].[t_award].[userid] as [ab].[userid] AND [test].[20090922_frame].[t_award].[granted] as [ae].[granted]>=[test].[20090922_frame].[t_award].[granted] as [ab].[granted] AND [test].[20090922_frame].[t_award].[granted] as [ae].[granted]<=[Expr1010]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ae].[userid]))
| | |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ae]))
| |--Filter(WHERE:([Expr1006]>=(5)))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ab].[userid]))
| |--Nested Loops(Inner Join, OUTER REFERENCES:([ab].[userid], [ab].[granted], [Expr1014]) WITH UNORDERED PREFETCH)
| |--Compute Scalar(DEFINE:([Expr1010]=dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted] as [ab].[granted])))
| | |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ab]))
| |--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[Expr1013],0)))
| |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
| |--Index Seek(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted] AS [ai]), SEEK:([ai].[userid]=[test].[20090922_frame].[t_award].[userid] as [ab].[userid] AND [ai].[granted] >= [test].[20090922_frame].[t_award].[granted] as [ab].[granted] AND [ai].[granted] <= dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted] as [ab].[granted])) ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([am].[id]))
|--Clustered Index Scan(OBJECT:([test].[20090922_frame].[t_award].[PK__t_award__3C94E422] AS [am]))
This is quite fast, 0.6 seconds
However, this query contains range index scans which can only be processed using NESTED LOOPS
in SQL Server. And NESTED LOOPS
is not the most efficient algorithm here since almost all values need to be processed.
If we could replace the range condition with the equality checks that would give the optimizer the opportunity to replace NESTED LOOPS
with more efficient MERGE JOIN
or HASH JOIN
.
Fortunately, we can do this for this task.
We cannot do equality searches for the dates because they are not well-ordered: we cannot tell the next date given the current one.
However, we can enumerate the rows and do this with the row numbers.
Here's how we build the query:
-
Get the user-wise row numbers using ROW_NUMBER()
. This is fast since we have a composite index on (userid, granted)
: no sort.
-
For each row, find out if its 5-minute timeframe contains at least 5 rows.
To do this we just need to check that the row with same userid
and the number greater by 4 is within the timeframe.
This is an equijoin, not a range join: endvalue.rn = startvalue.rn + 4
.
-
If the 4th row starting from the current one is within the timeframe, return the next 4 rows.
Normally, that would require a range condition, but we have ordinal row numbers and can do this using an equijoin just as well.
To do this, we just need to generate numbers from 0 to 4 using a recursive CTE, add them to the current row number and equijoin on the resulting value. This will give us 5 records we were after.
-
Group distinct id
's and joins them back with the table. This is like in the previous quiery.
And here is the query itself:
WITH rows AS
(
SELECT 0 AS row
UNION ALL
SELECT row + 1
FROM rows
WHERE row < 5 - 1
),
ranges AS
(
SELECT id, userid, granted, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY granted) AS rn
FROM [20090922_frame].t_award
)
SELECT am.*
FROM (
SELECT DISTINCT rr.id
FROM ranges rb
CROSS JOIN
rows
JOIN ranges rr
ON rr.userid = rb.userid
AND rr.rn = rb.rn + row
WHERE EXISTS
(
SELECT NULL
FROM ranges re
WHERE re.userid = rb.userid
AND re.rn = rb.rn + 4
AND re.granted <= DATEADD(minute, 5, rb.granted)
)
) rd
JOIN [20090922_frame].t_award am
ON am.id = rd.id
[/sourcecode]
View the results, statistics and execution plan
id |
userid |
granted |
value |
2353 |
1 |
2009-09-29 11:37:22.000 |
Award 2353 |
2354 |
1 |
2009-09-29 11:38:22.000 |
Award 2354 |
2355 |
1 |
2009-09-29 11:40:02.000 |
Award 2355 |
2356 |
1 |
2009-09-29 11:41:06.000 |
Award 2356 |
9557 |
1 |
2009-10-22 06:54:13.000 |
Award 9557 |
9558 |
1 |
2009-10-22 06:55:07.000 |
Award 9558 |
9559 |
1 |
2009-10-22 06:56:09.000 |
Award 9559 |
16517 |
2 |
2009-10-12 17:31:37.000 |
Award 16517 |
16518 |
2 |
2009-10-12 17:32:48.000 |
Award 16518 |
16519 |
2 |
2009-10-12 17:33:59.000 |
Award 16519 |
16520 |
2 |
2009-10-12 17:35:07.000 |
Award 16520 |
39807 |
4 |
2009-10-23 01:08:26.000 |
Award 39807 |
39808 |
4 |
2009-10-23 01:09:43.000 |
Award 39808 |
39809 |
4 |
2009-10-23 01:10:48.000 |
Award 39809 |
40493 |
5 |
2009-09-23 13:42:25.000 |
Award 40493 |
40494 |
5 |
2009-09-23 13:43:18.000 |
Award 40494 |
47151 |
5 |
2009-10-14 15:19:56.000 |
Award 47151 |
47152 |
5 |
2009-10-14 15:21:03.000 |
Award 47152 |
47153 |
5 |
2009-10-14 15:22:16.000 |
Award 47153 |
47154 |
5 |
2009-10-14 15:23:59.000 |
Award 47154 |
52975 |
6 |
2009-10-01 12:37:53.000 |
Award 52975 |
52976 |
6 |
2009-10-01 12:39:18.000 |
Award 52976 |
52977 |
6 |
2009-10-01 12:40:22.000 |
Award 52977 |
52978 |
6 |
2009-10-01 12:41:44.000 |
Award 52978 |
53543 |
6 |
2009-10-03 08:01:01.000 |
Award 53543 |
53544 |
6 |
2009-10-03 08:02:08.000 |
Award 53544 |
53545 |
6 |
2009-10-03 08:03:42.000 |
Award 53545 |
53546 |
6 |
2009-10-03 08:04:32.000 |
Award 53546 |
75239 |
8 |
2009-10-08 16:51:42.000 |
Award 75239 |
2357 |
1 |
2009-09-29 11:42:22.000 |
Award 2357 |
9560 |
1 |
2009-10-22 06:57:12.000 |
Award 9560 |
9561 |
1 |
2009-10-22 06:58:27.000 |
Award 9561 |
16521 |
2 |
2009-10-12 17:36:33.000 |
Award 16521 |
39810 |
4 |
2009-10-23 01:11:55.000 |
Award 39810 |
39811 |
4 |
2009-10-23 01:12:54.000 |
Award 39811 |
39812 |
4 |
2009-10-23 01:14:25.000 |
Award 39812 |
40495 |
5 |
2009-09-23 13:44:56.000 |
Award 40495 |
40496 |
5 |
2009-09-23 13:46:06.000 |
Award 40496 |
40497 |
5 |
2009-09-23 13:47:17.000 |
Award 40497 |
47155 |
5 |
2009-10-14 15:24:54.000 |
Award 47155 |
52979 |
6 |
2009-10-01 12:42:45.000 |
Award 52979 |
53547 |
6 |
2009-10-03 08:05:23.000 |
Award 53547 |
75240 |
8 |
2009-10-08 16:52:40.000 |
Award 75240 |
75241 |
8 |
2009-10-08 16:53:52.000 |
Award 75241 |
75242 |
8 |
2009-10-08 16:55:28.000 |
Award 75242 |
75243 |
8 |
2009-10-08 16:56:20.000 |
Award 75243 |
46 rows fetched in 0.0025s (0.3807s) |
Table 'Worktable'. Scan count 4, logical reads 79, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_award'. Scan count 8, logical reads 2639, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 702 ms, elapsed time = 380 ms.
|--Parallelism(Gather Streams)
|--Merge Join(Inner Join, MERGE:([test].[20090922_frame].[t_award].[id])=([am].[id]), RESIDUAL:([test].[20090922_frame].[t_award].[id] as [am].[id]=[test].[20090922_frame].[t_award].[id]))
|--Sort(ORDER BY:([test].[20090922_frame].[t_award].[id] ASC))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[id]))
| |--Hash Match(Right Semi Join, HASH:([test].[20090922_frame].[t_award].[userid], [Expr1020])=([test].[20090922_frame].[t_award].[userid], [Expr1011]), RESIDUAL:([test].[20090922_frame].[t_award].[userid]=[test].[20090922_frame].[t_award].[userid] AND [Expr1011]=[Expr1020]))
| |--Bitmap(HASH:([test].[20090922_frame].[t_award].[userid], [Expr1020]), DEFINE:([Bitmap1032]))
| | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[userid], [Expr1020]))
| | |--Compute Scalar(DEFINE:([Expr1020]=[Expr1003]+CONVERT_IMPLICIT(bigint,[Recr1007],0)))
| | |--Nested Loops(Inner Join)
| | |--Parallelism(Distribute Streams, RoundRobin Partitioning)
| | | |--Index Spool(WITH STACK)
| | | |--Concatenation
| | | |--Compute Scalar(DEFINE:([Expr1022]=(0)))
| | | | |--Constant Scan(VALUES:(((0))))
| | | |--Assert(WHERE:(CASE WHEN [Expr1024]>(100) THEN (0) ELSE NULL END))
| | | |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1024], [Recr1005]))
| | | |--Compute Scalar(DEFINE:([Expr1024]=[Expr1023]+(1)))
| | | | |--Table Spool(WITH STACK)
| | | |--Compute Scalar(DEFINE:([Expr1006]=[Recr1005]+(1)))
| | | |--Filter(WHERE:(STARTUP EXPR([Recr1005]<(4))))
| | | |--Constant Scan
| | |--Table Spool
| | |--Merge Join(Right Semi Join, MANY-TO-MANY MERGE:([test].[20090922_frame].[t_award].[userid], [Expr1015])=([test].[20090922_frame].[t_award].[userid], [Expr1021]), RESIDUAL:([test].[20090922_frame].[t_award].[userid]=[test].[20090922_frame].[t_award].[userid] AND [Expr1015]=([Expr1003]+(4)) AND [test].[20090922_frame].[t_award].[granted]<=[Expr1019]))
| | |--Sequence Project(DEFINE:([Expr1015]=row_number))
| | | |--Compute Scalar(DEFINE:([Expr1027]=(1)))
| | | |--Segment
| | | |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
| | |--Compute Scalar(DEFINE:([Expr1021]=[Expr1003]+(4)))
| | |--Compute Scalar(DEFINE:([Expr1019]=dateadd(minute,(5),[test].[20090922_frame].[t_award].[granted])))
| | |--Sequence Project(DEFINE:([Expr1003]=row_number))
| | |--Compute Scalar(DEFINE:([Expr1029]=(1)))
| | |--Segment
| | |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
| |--Parallelism(Distribute Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090922_frame].[t_award].[userid], [Expr1011]), WHERE:(PROBE([Bitmap1032])=TRUE))
| |--Sequence Project(DEFINE:([Expr1011]=row_number))
| |--Compute Scalar(DEFINE:([Expr1031]=(1)))
| |--Segment
| |--Index Scan(OBJECT:([test].[20090922_frame].[t_award].[IX_award_userid_granted]), ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([am].[id]), ORDER BY:([am].[id] ASC))
|--Clustered Index Scan(OBJECT:([test].[20090922_frame].[t_award].[PK__t_award__3C94E422] AS [am]), ORDERED FORWARD)
As we can see, the most expensive Nested Loops
with Filter
(that was used for the COUNT(*)
subquery) is replaced with more efficient Hash Match (Right Semi Join)
that processes the subquery much faster.
As a result, the query completes in only 0.38 seconds, or almost twice as fast as the original query.