Continuing the series on selecting records holding group-wise maximums:
How do I select the whole records, grouped on grouper
and holding a group-wise maximum (or minimum) on other column?
In this article, I'll consider SQL Server.
SQL Server has very rich SQL syntax and its optimizer is really powerful.
However, some tricks are still required to make the queries like this to run faster.
Let's create a sample table:
Table creation details
CREATE SCHEMA [20091130_groupwise]
CREATE TABLE t_distinct (
id INT NOT NULL PRIMARY KEY,
orderer INT NOT NULL,
glow INT NOT NULL,
ghigh INT NOT NULL,
stuffing VARCHAR(200) NOT NULL
)
GO
CREATE INDEX ix_distinct_glow_id ON [20091130_groupwise].t_distinct (glow, id);
CREATE INDEX ix_distinct_ghigh_id ON [20091130_groupwise].t_distinct (ghigh, id);
CREATE INDEX ix_distinct_glow_orderer_id ON [20091130_groupwise].t_distinct (glow, orderer, id);
CREATE INDEX ix_distinct_ghigh_orderer_id ON [20091130_groupwise].t_distinct (ghigh, orderer, id);
BEGIN TRANSACTION
SELECT RAND(20091130)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
INSERT
INTO [20091130_groupwise].t_distinct (id, orderer, glow, ghigh, stuffing)
VALUES (
@cnt,
FLOOR(RAND() * 9) + 1,
(@cnt - 1) % 10 + 1,
(@cnt - 1) % 10000 + 1,
REPLICATE('*', 200)
)
SET @cnt = @cnt + 1
END
COMMIT
GO
As in the previous articles, this table has 1,000,000 records:
id
is the PRIMARY KEY
orderer
is filled with random values from 1 to 10
glow
is a low cardinality grouping field (10 distinct values)
ghigh
is a high cardinality grouping field (10,000 distinct values)
stuffing
is an asterisk-filled VARCHAR(200)
column added to emulate payload of the actual tables
Now, let's make a couple of queries.
Analytic functions
Just like PostgreSQL 8.4 that we considered in the previous article, SQL Server implements window (analytic) functions.
We can just copy the queries from the previous articles and see the results.
Unique rows (no ties)
SELECT id, orderer, glow, ghigh
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn
FROM [20091130_groupwise].t_distinct
) q
WHERE rn = 1
View query details
id |
orderer |
glow |
ghigh |
1 |
8 |
1 |
1 |
2 |
9 |
2 |
2 |
3 |
2 |
3 |
3 |
4 |
1 |
4 |
4 |
5 |
9 |
5 |
5 |
6 |
8 |
6 |
6 |
7 |
5 |
7 |
7 |
8 |
1 |
8 |
8 |
9 |
3 |
9 |
9 |
10 |
6 |
10 |
10 |
10 rows fetched in 0.0005s (10.3436s) |
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 42, read-ahead reads 23927, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 3593 ms, elapsed time = 10342 ms.
|--Parallelism(Gather Streams)
|--Filter(WHERE:([Expr1003]=(1)))
|--Parallelism(Distribute Streams, RoundRobin Partitioning)
|--Sequence Project(DEFINE:([Expr1003]=row_number))
|--Compute Scalar(DEFINE:([Expr1005]=(1)))
|--Segment
|--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
|--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))
Returning all ties
SELECT COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM (
SELECT d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
FROM [20091130_groupwise].t_distinct d
) dd
WHERE dr = 1
View query details
cnt |
psum |
111013 |
22202600 |
1 row fetched in 0.0000s (39.6242s) |
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 188, read-ahead reads 22053, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6704 ms, elapsed time = 39616 ms.
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1007],0), [Expr1004]=CASE WHEN [globalagg1009]=(0) THEN NULL ELSE [globalagg1011] END))
|--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
|--Parallelism(Gather Streams)
|--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=COUNT_BIG([Expr1005]), [partialagg1010]=SUM([Expr1005])))
|--Filter(WHERE:([Expr1002]=(1)))
|--Compute Scalar(DEFINE:([Expr1005]=len([test].[20091130_groupwise].[t_distinct].[stuffing] as [d].[stuffing])))
|--Parallelism(Distribute Streams, RoundRobin Partitioning)
|--Sequence Project(DEFINE:([Expr1002]=dense_rank))
|--Segment
|--Segment
|--Parallelism(Gather Streams, ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
|--Sort(ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]))
Resolving ties by returning the MAX(id)
SELECT id, orderer, glow, ghigh
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY orderer, id DESC) AS rn
FROM [20091130_groupwise].t_distinct
) dd
WHERE rn = 1
View query details
id |
orderer |
glow |
ghigh |
999951 |
1 |
1 |
9951 |
999962 |
1 |
2 |
9962 |
999933 |
1 |
3 |
9933 |
999994 |
1 |
4 |
9994 |
999965 |
1 |
5 |
9965 |
999906 |
1 |
6 |
9906 |
999987 |
1 |
7 |
9987 |
999998 |
1 |
8 |
9998 |
999969 |
1 |
9 |
9969 |
999960 |
1 |
10 |
9960 |
10 rows fetched in 0.0005s (9.1873s) |
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 41, read-ahead reads 22122, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4780 ms, elapsed time = 9188 ms.
|--Parallelism(Gather Streams)
|--Filter(WHERE:([Expr1003]=(1)))
|--Parallelism(Distribute Streams, RoundRobin Partitioning)
|--Sequence Project(DEFINE:([Expr1003]=row_number))
|--Compute Scalar(DEFINE:([Expr1005]=(1)))
|--Segment
|--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
|--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))
Very elegant and very slow, 9 to 38 (!) seconds.
In all cases, SQL Server uses a clustered index scan. The table is too large to fit into the cache. Many physical reads (including read-ahead reads) are required by the query.
DISTINCT / CROSS APPLY
Unlike PostgreSQL and MySQL, SQL Server does not implement any native method for returning a singe record out of a group. However, SQL Server does implement a very useful clause, CROSS APPLY
, which allows taking the records from the main query and applying their results to a subquery, where they can be referenced. Unlike set-based JOIN
, CROSS APPLY
is record-based and clearly distinguishes between the leading and the driven tables. A driven subquery can use the values from the leading query not only in the WHERE
clause but also in the TOP
clause, can have its own DISTINCT
and ORDER BY
clauses etc.
CROSS APPLY
can improve the query used in our task as well. We should just take the list of the distinct groupers and cross apply the query which would return a first record for each grouper (resolving ties as necessary).
Unique rows (no ties)
SELECT ds.*
FROM (
SELECT DISTINCT glow
FROM [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT TOP 1 id, orderer, glow, ghigh
FROM [20091130_groupwise].t_distinct di
WHERE di.glow = dd.glow
ORDER BY
id
) ds
View query details
id |
orderer |
glow |
ghigh |
999991 |
2 |
1 |
9991 |
999992 |
3 |
2 |
9992 |
999993 |
8 |
3 |
9993 |
999994 |
1 |
4 |
9994 |
999995 |
3 |
5 |
9995 |
999996 |
9 |
6 |
9996 |
999997 |
4 |
7 |
9997 |
999998 |
1 |
8 |
9998 |
999999 |
9 |
9 |
9999 |
1000000 |
6 |
10 |
10000 |
10 rows fetched in 0.0005s (0.2460s) |
Table 't_distinct'. Scan count 11, logical reads 3160, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 234 ms, elapsed time = 246 ms.
|--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
|--Stream Aggregate(GROUP BY:([d].[glow]))
| |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
|--Top(TOP EXPRESSION:((1)))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]), WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED BACKWARD)
This is quite fast, 250 ms.
Resolving ties by returning the MAX(id)
within the MIN(orderer)
Since the orderer
and the id
are ordered differently in this query, a single CROSS APPLY
would not be able to use the index efficiently:
SELECT ds.*
FROM (
SELECT DISTINCT glow
FROM [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT TOP 1 id, orderer, glow, ghigh
FROM [20091130_groupwise].t_distinct di
WHERE di.glow = dd.glow
ORDER BY
orderer, id DESC
) ds
View query details
id |
orderer |
glow |
ghigh |
999965 |
1 |
5 |
9965 |
999906 |
1 |
6 |
9906 |
999987 |
1 |
7 |
9987 |
999998 |
1 |
8 |
9998 |
999951 |
1 |
1 |
9951 |
999962 |
1 |
2 |
9962 |
999933 |
1 |
3 |
9933 |
999994 |
1 |
4 |
9994 |
999969 |
1 |
9 |
9969 |
999960 |
1 |
10 |
9960 |
10 rows fetched in 0.0006s (15.9986s) |
Table 't_distinct'. Scan count 4, logical reads 33862, physical reads 42, read-ahead reads 27281, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 10, logical reads 2959407, physical reads 3114, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6328 ms, elapsed time = 16010 ms.
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
|--Stream Aggregate(GROUP BY:([d].[glow]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([d].[glow]), ORDER BY:([d].[glow] ASC))
| |--Stream Aggregate(GROUP BY:([d].[glow]))
| |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
|--Sort(TOP 1, ORDER BY:([di].[orderer] ASC, [di].[id] DESC))
|--Index Spool(SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]))
This query has to sort to find the top record. Since there are too much records to sort, it's very inefficient, 16 seconds.
To benefit from the index, we should use a CROSS APPLY
twice: to select the MIN(orderer)
and to select the MAX(id)
within this orderer
:
SELECT ds.*
FROM (
SELECT DISTINCT glow
FROM [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT TOP 1 dss.*
FROM (
SELECT TOP 1 orderer
FROM [20091130_groupwise].t_distinct di
WHERE di.glow = dd.glow
) ddd
CROSS APPLY
(
SELECT TOP 1 id, orderer, glow, ghigh
FROM [20091130_groupwise].t_distinct dii
WHERE dii.glow = dd.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds
View query details
id |
orderer |
glow |
ghigh |
999951 |
1 |
1 |
9951 |
999962 |
1 |
2 |
9962 |
999933 |
1 |
3 |
9933 |
999994 |
1 |
4 |
9994 |
999965 |
1 |
5 |
9965 |
999906 |
1 |
6 |
9906 |
999987 |
1 |
7 |
9987 |
999998 |
1 |
8 |
9998 |
999969 |
1 |
9 |
9969 |
999960 |
1 |
10 |
9960 |
10 rows fetched in 0.0005s (0.2416s) |
Table 't_distinct'. Scan count 21, logical reads 3228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 235 ms, elapsed time = 241 ms.
|--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
|--Stream Aggregate(GROUP BY:([d].[glow]))
| |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
|--Top(TOP EXPRESSION:((1)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
|--Top(TOP EXPRESSION:((1)))
| |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED FORWARD)
|--Top(TOP EXPRESSION:((1)))
|--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]), WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [dii].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow] AND [test].[20091130_groupwise].[t_distinct].[orderer] as [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)
Now, this query is efficient again, only 250 ms.
But can we improve the query yet a little?
Recursive CTE's
Just like PostgreSQL 8.4, SQL Server 2005 uses recursive CTE's. Unfortunately, it does not use a truly set-based approach to implement them. That's why it's impossible to use a LIMIT / TOP clause to build the list of distinct groupers like we did in PostgreSQL 8.4.
However, SQL Server 2005 allows using window function in the recursive CTE's, and its optimizer is even smart enough to build an efficient plan for them that would use the TOP
access path over the index.
The idea is basically the same as what we did in PostgreSQL:
- Find the lowest grouper in the anchor part of the recursive CTE
- Iteratively find the next groupers in the recursive part
The only difference is that we will have to use ROW_NUMBER
instead of TOP
.
This will jump over the index and find the distinct groupers in no time (of course only if there are not many of them).
Here are the queries:
Unique rows (no ties)
WITH rows AS
(
SELECT MIN(glow) AS glow
FROM [20091130_groupwise].t_distinct
UNION ALL
SELECT glow
FROM (
SELECT di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM rows r
JOIN [20091130_groupwise].t_distinct di
ON di.glow > r.glow
WHERE r.glow IS NOT NULL
) q
WHERE rn = 1
)
SELECT ds.*
FROM rows r
CROSS APPLY
(
SELECT TOP 1 id, orderer, glow, ghigh
FROM [20091130_groupwise].t_distinct d
WHERE d.glow = r.glow
ORDER BY
id DESC
) ds
OPTION (MAXRECURSION 0)
View query details
id |
orderer |
glow |
ghigh |
999991 |
2 |
1 |
9991 |
999992 |
3 |
2 |
9992 |
999993 |
8 |
3 |
9993 |
999994 |
1 |
4 |
9994 |
999995 |
3 |
5 |
9995 |
999996 |
9 |
6 |
9996 |
999997 |
4 |
7 |
9997 |
999998 |
1 |
8 |
9998 |
999999 |
9 |
9 |
9999 |
1000000 |
6 |
10 |
10000 |
10 rows fetched in 0.0005s (0.0031s) |
Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_distinct'. Scan count 21, logical reads 133, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 3 ms.
|--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
|--Index Spool(WITH STACK)
| |--Concatenation
| |--Compute Scalar(DEFINE:([Expr1015]=(0)))
| | |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
| | |--Top(TOP EXPRESSION:((1)))
| | |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
| |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1017], [Recr1004]))
| |--Compute Scalar(DEFINE:([Expr1017]=[Expr1016]+(1)))
| | |--Table Spool(WITH STACK)
| |--Filter(WHERE:([Expr1007]=(1)))
| |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)<(0) THEN (0) ELSE (1) END))
| |--Sequence Project(DEFINE:([Expr1007]=row_number))
| |--Compute Scalar(DEFINE:([Expr1014]=(1)))
| |--Segment
| |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
| |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] > [Recr1004]) ORDERED FORWARD)
|--Index Spool(SEEK:([Recr1008]=[Recr1008]))
|--Top(TOP EXPRESSION:((1)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([d].[id], [Expr1019]) WITH ORDERED PREFETCH)
|--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), SEEK:([d].[glow]=[Recr1008]) ORDERED BACKWARD)
|--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]), SEEK:([d].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [d].[id]) LOOKUP ORDERED FORWARD)
Resolving ties by returning the MAX(id)
within the MIN(orderer)
WITH rows AS
(
SELECT MIN(glow) AS glow
FROM [20091130_groupwise].t_distinct
UNION ALL
SELECT glow
FROM (
SELECT di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM rows r
JOIN [20091130_groupwise].t_distinct di
ON di.glow > r.glow
WHERE r.glow IS NOT NULL
) q
WHERE rn = 1
)
SELECT ds.*
FROM rows r
CROSS APPLY
(
SELECT TOP 1 dss.*
FROM (
SELECT TOP 1 orderer
FROM [20091130_groupwise].t_distinct di
WHERE di.glow = r.glow
) ddd
CROSS APPLY
(
SELECT TOP 1 id, orderer, glow, ghigh
FROM [20091130_groupwise].t_distinct dii
WHERE dii.glow = r.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds
View query details
id |
orderer |
glow |
ghigh |
999951 |
1 |
1 |
9951 |
999962 |
1 |
2 |
9962 |
999933 |
1 |
3 |
9933 |
999994 |
1 |
4 |
9994 |
999965 |
1 |
5 |
9965 |
999906 |
1 |
6 |
9906 |
999987 |
1 |
7 |
9987 |
999998 |
1 |
8 |
9998 |
999969 |
1 |
9 |
9969 |
999960 |
1 |
10 |
9960 |
10 rows fetched in 0.0005s (0.0032s) |
Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_distinct'. Scan count 31, logical reads 172, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 3 ms.
|--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
|--Index Spool(WITH STACK)
| |--Concatenation
| |--Compute Scalar(DEFINE:([Expr1017]=(0)))
| | |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
| | |--Top(TOP EXPRESSION:((1)))
| | |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
| |--Assert(WHERE:(CASE WHEN [Expr1019]>(100) THEN (0) ELSE NULL END))
| |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1019], [Recr1004]))
| |--Compute Scalar(DEFINE:([Expr1019]=[Expr1018]+(1)))
| | |--Table Spool(WITH STACK)
| |--Filter(WHERE:([Expr1007]=(1)))
| |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)<(0) THEN (0) ELSE (1) END))
| |--Sequence Project(DEFINE:([Expr1007]=row_number))
| |--Compute Scalar(DEFINE:([Expr1016]=(1)))
| |--Segment
| |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
| |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] > [Recr1004]) ORDERED FORWARD)
|--Index Spool(SEEK:([Recr1008]=[Recr1008]))
|--Top(TOP EXPRESSION:((1)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
|--Top(TOP EXPRESSION:((1)))
| |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[Recr1008]) ORDERED FORWARD)
|--Top(TOP EXPRESSION:((1)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([dii].[id], [Expr1021]) WITH ORDERED PREFETCH)
|--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [dii]), SEEK:([dii].[glow]=[Recr1008] AND [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)
|--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]), SEEK:([dii].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [dii].[id]) LOOKUP ORDERED FORWARD)
Both these queries are very complex and scary to look at.
However, they complete in only 3 ms. This is 3,000 (!) times as efficient as a plain ROW_NUMBER
query, though the latter is much more simple to understand.
Summary
To select records holding group-wise maximums in SQL Server, the following approaches can be used:
- Analytic functions
- Using
DISTINCT
/ CROSS APPLY
/ TOP
- Using recursive CTE's /
CROSS APPLY
/ TOP
These methods are arranged in order of increasing complexity and decreasing cost (for low-cardinality groupers).
In there are few groupers in the table, the index scan (used by DISTINCT
) or a loose index scan (emulated by the recursive CTE's) help to build a list of distinct groupers much faster and the CROSS APPLY
then uses the secondary indexes to select the first record within each group.
To be continued.
you are a lifesaver. Thanks a lot.
roshan
26 Feb 13 at 18:02