SQL Server: Selecting records holding group-wise maximum at EXPLAIN EXTENDED

SQL Server: Selecting records holding group-wise maximum

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll consider SQL Server.

SQL Server has very rich SQL syntax and its optimizer is really powerful.

However, some tricks are still required to make the queries like this to run faster.

Let's create a sample table:

Table creation details

CREATE SCHEMA [20091130_groupwise]
CREATE TABLE t_distinct (
id INT NOT NULL PRIMARY KEY,
orderer INT NOT NULL,
glow INT NOT NULL,
ghigh INT NOT NULL,
stuffing VARCHAR(200) NOT NULL
)
GO
CREATE INDEX ix_distinct_glow_id ON [20091130_groupwise].t_distinct (glow, id);
CREATE INDEX ix_distinct_ghigh_id ON [20091130_groupwise].t_distinct (ghigh, id);
CREATE INDEX ix_distinct_glow_orderer_id ON [20091130_groupwise].t_distinct (glow, orderer, id);
CREATE INDEX ix_distinct_ghigh_orderer_id ON [20091130_groupwise].t_distinct (ghigh, orderer, id);
BEGIN TRANSACTION
SELECT  RAND(20091130)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 1000000
BEGIN
        INSERT
        INTO    [20091130_groupwise].t_distinct (id, orderer, glow, ghigh, stuffing)
        VALUES  (
                @cnt,
                FLOOR(RAND() * 9) + 1,
                (@cnt - 1) % 10 + 1,
                (@cnt - 1) % 10000 + 1,
                REPLICATE('*', 200)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO

As in the previous articles, this table has 1,000,000 records:

id is the PRIMARY KEY
orderer is filled with random values from 1 to 10
glow is a low cardinality grouping field (10 distinct values)
ghigh is a high cardinality grouping field (10,000 distinct values)
stuffing is an asterisk-filled VARCHAR(200) column added to emulate payload of the actual tables

Now, let's make a couple of queries.

Analytic functions

Just like PostgreSQL 8.4 that we considered in the previous article, SQL Server implements window (analytic) functions.

We can just copy the queries from the previous articles and see the results.

Unique rows (no ties)

SELECT  id, orderer, glow, ghigh
FROM    (
SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn
FROM    [20091130_groupwise].t_distinct
) q
WHERE   rn = 1

View query details

id	orderer	glow	ghigh
1	8	1	1
2	9	2	2
3	2	3	3
4	1	4	4
5	9	5	5
6	8	6	6
7	5	7	7
8	1	8	8
9	3	9	9
10	6	10	10
10 rows fetched in 0.0005s (10.3436s)

Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 42, read-ahead reads 23927, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 3593 ms,  elapsed time = 10342 ms.

  |--Parallelism(Gather Streams)
       |--Filter(WHERE:([Expr1003]=(1)))
            |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                 |--Sequence Project(DEFINE:([Expr1003]=row_number))
                      |--Compute Scalar(DEFINE:([Expr1005]=(1)))
                           |--Segment
                                |--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
                                     |--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
                                          |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))

Returning all ties

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
FROM    [20091130_groupwise].t_distinct d
) dd
WHERE   dr = 1

View query details

cnt	psum
111013	22202600
1 row fetched in 0.0000s (39.6242s)

Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 188, read-ahead reads 22053, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 6704 ms,  elapsed time = 39616 ms.

  |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1007],0), [Expr1004]=CASE WHEN [globalagg1009]=(0) THEN NULL ELSE [globalagg1011] END))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=COUNT_BIG([Expr1005]), [partialagg1010]=SUM([Expr1005])))
                      |--Filter(WHERE:([Expr1002]=(1)))
                           |--Compute Scalar(DEFINE:([Expr1005]=len([test].[20091130_groupwise].[t_distinct].[stuffing] as [d].[stuffing])))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1002]=dense_rank))
                                          |--Segment
                                               |--Segment
                                                    |--Parallelism(Gather Streams, ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                         |--Sort(ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                              |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]))

Resolving ties by returning the `MAX(id)`

SELECT  id, orderer, glow, ghigh
FROM    (
SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY orderer, id DESC) AS rn
FROM    [20091130_groupwise].t_distinct
) dd
WHERE   rn = 1

View query details

id	orderer	glow	ghigh
999951	1	1	9951
999962	1	2	9962
999933	1	3	9933
999994	1	4	9994
999965	1	5	9965
999906	1	6	9906
999987	1	7	9987
999998	1	8	9998
999969	1	9	9969
999960	1	10	9960
10 rows fetched in 0.0005s (9.1873s)

Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 41, read-ahead reads 22122, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 4780 ms,  elapsed time = 9188 ms.

  |--Parallelism(Gather Streams)
       |--Filter(WHERE:([Expr1003]=(1)))
            |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                 |--Sequence Project(DEFINE:([Expr1003]=row_number))
                      |--Compute Scalar(DEFINE:([Expr1005]=(1)))
                           |--Segment
                                |--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
                                     |--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
                                          |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))

Very elegant and very slow, 9 to 38 (!) seconds.

In all cases, SQL Server uses a clustered index scan. The table is too large to fit into the cache. Many physical reads (including read-ahead reads) are required by the query.

DISTINCT / CROSS APPLY

Unlike PostgreSQL and MySQL, SQL Server does not implement any native method for returning a singe record out of a group. However, SQL Server does implement a very useful clause, CROSS APPLY, which allows taking the records from the main query and applying their results to a subquery, where they can be referenced. Unlike set-based JOIN, CROSS APPLY is record-based and clearly distinguishes between the leading and the driven tables. A driven subquery can use the values from the leading query not only in the WHERE clause but also in the TOP clause, can have its own DISTINCT and ORDER BY clauses etc.

CROSS APPLY can improve the query used in our task as well. We should just take the list of the distinct groupers and cross apply the query which would return a first record for each grouper (resolving ties as necessary).

Unique rows (no ties)

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
id
) ds

View query details

id	orderer	glow	ghigh
999991	2	1	9991
999992	3	2	9992
999993	8	3	9993
999994	1	4	9994
999995	3	5	9995
999996	9	6	9996
999997	4	7	9997
999998	1	8	9998
999999	9	9	9999
1000000	6	10	10000
10 rows fetched in 0.0005s (0.2460s)

Table 't_distinct'. Scan count 11, logical reads 3160, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 234 ms,  elapsed time = 246 ms.

  |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
       |--Stream Aggregate(GROUP BY:([d].[glow]))
       |    |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
       |--Top(TOP EXPRESSION:((1)))
            |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]),  WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED BACKWARD)

This is quite fast, 250 ms.

Resolving ties by returning the `MAX(id)` within the `MIN(orderer)`

Since the orderer and the id are ordered differently in this query, a single CROSS APPLY would not be able to use the index efficiently:

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
orderer, id DESC
) ds

View query details

id	orderer	glow	ghigh
999965	1	5	9965
999906	1	6	9906
999987	1	7	9987
999998	1	8	9998
999951	1	1	9951
999962	1	2	9962
999933	1	3	9933
999994	1	4	9994
999969	1	9	9969
999960	1	10	9960
10 rows fetched in 0.0006s (15.9986s)

Table 't_distinct'. Scan count 4, logical reads 33862, physical reads 42, read-ahead reads 27281, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 10, logical reads 2959407, physical reads 3114, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 6328 ms,  elapsed time = 16010 ms.

  |--Parallelism(Gather Streams)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
            |--Stream Aggregate(GROUP BY:([d].[glow]))
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([d].[glow]), ORDER BY:([d].[glow] ASC))
            |         |--Stream Aggregate(GROUP BY:([d].[glow]))
            |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
            |--Sort(TOP 1, ORDER BY:([di].[orderer] ASC, [di].[id] DESC))
                 |--Index Spool(SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]))
                      |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]))

This query has to sort to find the top record. Since there are too much records to sort, it's very inefficient, 16 seconds.

To benefit from the index, we should use a CROSS APPLY twice: to select the MIN(orderer) and to select the MAX(id) within this orderer:

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 dss.*
FROM    (
SELECT  TOP 1 orderer
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
) ddd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct dii
WHERE   dii.glow = dd.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds

View query details

id	orderer	glow	ghigh
999951	1	1	9951
999962	1	2	9962
999933	1	3	9933
999994	1	4	9994
999965	1	5	9965
999906	1	6	9906
999987	1	7	9987
999998	1	8	9998
999969	1	9	9969
999960	1	10	9960
10 rows fetched in 0.0005s (0.2416s)

Table 't_distinct'. Scan count 21, logical reads 3228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 235 ms,  elapsed time = 241 ms.

  |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
       |--Stream Aggregate(GROUP BY:([d].[glow]))
       |    |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
       |--Top(TOP EXPRESSION:((1)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
                 |--Top(TOP EXPRESSION:((1)))
                 |    |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED FORWARD)
                 |--Top(TOP EXPRESSION:((1)))
                      |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]),  WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [dii].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow] AND [test].[20091130_groupwise].[t_distinct].[orderer] as [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)

Now, this query is efficient again, only 250 ms.

But can we improve the query yet a little?

Recursive CTE's

Just like PostgreSQL 8.4, SQL Server 2005 uses recursive CTE's. Unfortunately, it does not use a truly set-based approach to implement them. That's why it's impossible to use a LIMIT / TOP clause to build the list of distinct groupers like we did in PostgreSQL 8.4.

However, SQL Server 2005 allows using window function in the recursive CTE's, and its optimizer is even smart enough to build an efficient plan for them that would use the TOP access path over the index.

The idea is basically the same as what we did in PostgreSQL:

Find the lowest grouper in the anchor part of the recursive CTE
Iteratively find the next groupers in the recursive part

The only difference is that we will have to use ROW_NUMBER instead of TOP.

This will jump over the index and find the distinct groupers in no time (of course only if there are not many of them).

Here are the queries:

Unique rows (no ties)

WITH    rows AS
(
SELECT  MIN(glow) AS glow
FROM    [20091130_groupwise].t_distinct
UNION ALL
SELECT  glow
FROM    (
SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM    rows r
JOIN    [20091130_groupwise].t_distinct di
ON      di.glow &gt; r.glow
WHERE   r.glow IS NOT NULL
) q
WHERE   rn = 1
)
SELECT  ds.*
FROM    rows r
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct d
WHERE   d.glow = r.glow
ORDER BY
id DESC
) ds
OPTION (MAXRECURSION 0)

View query details

id	orderer	glow	ghigh
999991	2	1	9991
999992	3	2	9992
999993	8	3	9993
999994	1	4	9994
999995	3	5	9995
999996	9	6	9996
999997	4	7	9997
999998	1	8	9998
999999	9	9	9999
1000000	6	10	10000
10 rows fetched in 0.0005s (0.0031s)

Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_distinct'. Scan count 21, logical reads 133, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 3 ms.

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
       |--Index Spool(WITH STACK)
       |    |--Concatenation
       |         |--Compute Scalar(DEFINE:([Expr1015]=(0)))
       |         |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
       |         |         |--Top(TOP EXPRESSION:((1)))
       |         |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
       |         |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1017], [Recr1004]))
       |              |--Compute Scalar(DEFINE:([Expr1017]=[Expr1016]+(1)))
       |              |    |--Table Spool(WITH STACK)
       |              |--Filter(WHERE:([Expr1007]=(1)))
       |                   |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)<(0) THEN (0) ELSE (1) END))
       |                        |--Sequence Project(DEFINE:([Expr1007]=row_number))
       |                             |--Compute Scalar(DEFINE:([Expr1014]=(1)))
       |                                  |--Segment
       |                                       |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
       |                                            |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] > [Recr1004]) ORDERED FORWARD)
       |--Index Spool(SEEK:([Recr1008]=[Recr1008]))
            |--Top(TOP EXPRESSION:((1)))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[id], [Expr1019]) WITH ORDERED PREFETCH)
                      |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), SEEK:([d].[glow]=[Recr1008]) ORDERED BACKWARD)
                      |--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]), SEEK:([d].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [d].[id]) LOOKUP ORDERED FORWARD)

Resolving ties by returning the `MAX(id)` within the `MIN(orderer)`

WITH    rows AS
(
SELECT  MIN(glow) AS glow
FROM    [20091130_groupwise].t_distinct
UNION ALL
SELECT  glow
FROM    (
SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM    rows r
JOIN    [20091130_groupwise].t_distinct di
ON      di.glow &gt; r.glow
WHERE   r.glow IS NOT NULL
) q
WHERE   rn = 1
)
SELECT  ds.*
FROM    rows r
CROSS APPLY
(
SELECT  TOP 1 dss.*
FROM    (
SELECT  TOP 1 orderer
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = r.glow
) ddd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct dii
WHERE   dii.glow = r.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds

View query details

id	orderer	glow	ghigh
999951	1	1	9951
999962	1	2	9962
999933	1	3	9933
999994	1	4	9994
999965	1	5	9965
999906	1	6	9906
999987	1	7	9987
999998	1	8	9998
999969	1	9	9969
999960	1	10	9960
10 rows fetched in 0.0005s (0.0032s)

Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_distinct'. Scan count 31, logical reads 172, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 3 ms.

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
       |--Index Spool(WITH STACK)
       |    |--Concatenation
       |         |--Compute Scalar(DEFINE:([Expr1017]=(0)))
       |         |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
       |         |         |--Top(TOP EXPRESSION:((1)))
       |         |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
       |         |--Assert(WHERE:(CASE WHEN [Expr1019]>(100) THEN (0) ELSE NULL END))
       |              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1019], [Recr1004]))
       |                   |--Compute Scalar(DEFINE:([Expr1019]=[Expr1018]+(1)))
       |                   |    |--Table Spool(WITH STACK)
       |                   |--Filter(WHERE:([Expr1007]=(1)))
       |                        |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)<(0) THEN (0) ELSE (1) END))
       |                             |--Sequence Project(DEFINE:([Expr1007]=row_number))
       |                                  |--Compute Scalar(DEFINE:([Expr1016]=(1)))
       |                                       |--Segment
       |                                            |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
       |                                                 |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] > [Recr1004]) ORDERED FORWARD)
       |--Index Spool(SEEK:([Recr1008]=[Recr1008]))
            |--Top(TOP EXPRESSION:((1)))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
                      |--Top(TOP EXPRESSION:((1)))
                      |    |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[Recr1008]) ORDERED FORWARD)
                      |--Top(TOP EXPRESSION:((1)))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([dii].[id], [Expr1021]) WITH ORDERED PREFETCH)
                                |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [dii]), SEEK:([dii].[glow]=[Recr1008] AND [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)
                                |--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]), SEEK:([dii].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [dii].[id]) LOOKUP ORDERED FORWARD)

Both these queries are very complex and scary to look at.

However, they complete in only 3 ms. This is 3,000 (!) times as efficient as a plain ROW_NUMBER query, though the latter is much more simple to understand.

Summary

To select records holding group-wise maximums in SQL Server, the following approaches can be used:

Analytic functions
Using DISTINCT / CROSS APPLY / TOP
Using recursive CTE's / CROSS APPLY / TOP

These methods are arranged in order of increasing complexity and decreasing cost (for low-cardinality groupers).

In there are few groupers in the table, the index scan (used by DISTINCT) or a loose index scan (emulated by the recursive CTE's) help to build a list of distinct groupers much faster and the CROSS APPLY then uses the secondary indexes to select the first record within each group.

To be continued.

Written by Quassnoi

November 30th, 2009 at 11:00 pm

Posted in SQL Server

One Response to 'SQL Server: Selecting records holding group-wise maximum'

Subscribe to comments with RSS

you are a lifesaver. Thanks a lot.

roshan

26 Feb 13 at 18:02

EXPLAIN EXTENDED