EXPLAIN EXTENDED

How to create fast database queries

SQL Server: Selecting records holding group-wise maximum

with one comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll consider SQL Server.

SQL Server has very rich SQL syntax and its optimizer is really powerful.

However, some tricks are still required to make the queries like this to run faster.

Let's create a sample table:

Table creation details

As in the previous articles, this table has 1,000,000 records:

  • id is the PRIMARY KEY
  • orderer is filled with random values from 1 to 10
  • glow is a low cardinality grouping field (10 distinct values)
  • ghigh is a high cardinality grouping field (10,000 distinct values)
  • stuffing is an asterisk-filled VARCHAR(200) column added to emulate payload of the actual tables

Now, let's make a couple of queries.

Analytic functions

Just like PostgreSQL 8.4 that we considered in the previous article, SQL Server implements window (analytic) functions.

We can just copy the queries from the previous articles and see the results.

Unique rows (no ties)

SELECT  id, orderer, glow, ghigh
FROM    (
SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn
FROM    [20091130_groupwise].t_distinct
) q
WHERE   rn = 1

View query details

Returning all ties

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
FROM    [20091130_groupwise].t_distinct d
) dd
WHERE   dr = 1

View query details

Resolving ties by returning the MAX(id)

SELECT  id, orderer, glow, ghigh
FROM    (
SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY orderer, id DESC) AS rn
FROM    [20091130_groupwise].t_distinct
) dd
WHERE   rn = 1

View query details

Very elegant and very slow, 9 to 38 (!) seconds.

In all cases, SQL Server uses a clustered index scan. The table is too large to fit into the cache. Many physical reads (including read-ahead reads) are required by the query.

DISTINCT / CROSS APPLY

Unlike PostgreSQL and MySQL, SQL Server does not implement any native method for returning a singe record out of a group. However, SQL Server does implement a very useful clause, CROSS APPLY, which allows taking the records from the main query and applying their results to a subquery, where they can be referenced. Unlike set-based JOIN, CROSS APPLY is record-based and clearly distinguishes between the leading and the driven tables. A driven subquery can use the values from the leading query not only in the WHERE clause but also in the TOP clause, can have its own DISTINCT and ORDER BY clauses etc.

CROSS APPLY can improve the query used in our task as well. We should just take the list of the distinct groupers and cross apply the query which would return a first record for each grouper (resolving ties as necessary).

Unique rows (no ties)

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
id
) ds

View query details

This is quite fast, 250 ms.

Resolving ties by returning the MAX(id) within the MIN(orderer)

Since the orderer and the id are ordered differently in this query, a single CROSS APPLY would not be able to use the index efficiently:

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
orderer, id DESC
) ds

View query details

This query has to sort to find the top record. Since there are too much records to sort, it's very inefficient, 16 seconds.

To benefit from the index, we should use a CROSS APPLY twice: to select the MIN(orderer) and to select the MAX(id) within this orderer:

SELECT  ds.*
FROM    (
SELECT  DISTINCT glow
FROM    [20091130_groupwise].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 dss.*
FROM    (
SELECT  TOP 1 orderer
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = dd.glow
) ddd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct dii
WHERE   dii.glow = dd.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds

View query details

Now, this query is efficient again, only 250 ms.

But can we improve the query yet a little?

Recursive CTE's

Just like PostgreSQL 8.4, SQL Server 2005 uses recursive CTE's. Unfortunately, it does not use a truly set-based approach to implement them. That's why it's impossible to use a LIMIT / TOP clause to build the list of distinct groupers like we did in PostgreSQL 8.4.

However, SQL Server 2005 allows using window function in the recursive CTE's, and its optimizer is even smart enough to build an efficient plan for them that would use the TOP access path over the index.

The idea is basically the same as what we did in PostgreSQL:

  1. Find the lowest grouper in the anchor part of the recursive CTE
  2. Iteratively find the next groupers in the recursive part

The only difference is that we will have to use ROW_NUMBER instead of TOP.

This will jump over the index and find the distinct groupers in no time (of course only if there are not many of them).

Here are the queries:

Unique rows (no ties)

WITH    rows AS
(
SELECT  MIN(glow) AS glow
FROM    [20091130_groupwise].t_distinct
UNION ALL
SELECT  glow
FROM    (
SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM    rows r
JOIN    [20091130_groupwise].t_distinct di
ON      di.glow > r.glow
WHERE   r.glow IS NOT NULL
) q
WHERE   rn = 1
)
SELECT  ds.*
FROM    rows r
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct d
WHERE   d.glow = r.glow
ORDER BY
id DESC
) ds
OPTION (MAXRECURSION 0)

View query details

Resolving ties by returning the MAX(id) within the MIN(orderer)

WITH    rows AS
(
SELECT  MIN(glow) AS glow
FROM    [20091130_groupwise].t_distinct
UNION ALL
SELECT  glow
FROM    (
SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM    rows r
JOIN    [20091130_groupwise].t_distinct di
ON      di.glow > r.glow
WHERE   r.glow IS NOT NULL
) q
WHERE   rn = 1
)
SELECT  ds.*
FROM    rows r
CROSS APPLY
(
SELECT  TOP 1 dss.*
FROM    (
SELECT  TOP 1 orderer
FROM    [20091130_groupwise].t_distinct di
WHERE   di.glow = r.glow
) ddd
CROSS APPLY
(
SELECT  TOP 1 id, orderer, glow, ghigh
FROM    [20091130_groupwise].t_distinct dii
WHERE   dii.glow = r.glow
AND dii.orderer = ddd.orderer
ORDER BY
id DESC
) dss
) ds

View query details

Both these queries are very complex and scary to look at.

However, they complete in only 3 ms. This is 3,000 (!) times as efficient as a plain ROW_NUMBER query, though the latter is much more simple to understand.

Summary

To select records holding group-wise maximums in SQL Server, the following approaches can be used:

  • Analytic functions
  • Using DISTINCT / CROSS APPLY / TOP
  • Using recursive CTE's / CROSS APPLY / TOP

These methods are arranged in order of increasing complexity and decreasing cost (for low-cardinality groupers).

In there are few groupers in the table, the index scan (used by DISTINCT) or a loose index scan (emulated by the recursive CTE's) help to build a list of distinct groupers much faster and the CROSS APPLY then uses the secondary indexes to select the first record within each group.

To be continued.

Written by Quassnoi

November 30th, 2009 at 11:00 pm

Posted in SQL Server

One Response to 'SQL Server: Selecting records holding group-wise maximum'

Subscribe to comments with RSS

  1. you are a lifesaver. Thanks a lot.

    roshan

    26 Feb 13 at 18:02

Leave a Reply