EXPLAIN EXTENDED

How to create fast database queries

SQL Server: Selecting records holding group-wise maximum (with ties)

Comments enabled. I *really* need your comment

A feedback on the yesterday's article on selecting records holding group-wise maximums in SQL Server.

Gonçalo asks:

Regarding the post I mention on the subject, wouldn't it be easier to obtain the result you're after using the SQL Server specific SELECT TOP x WITH TIES?

This clause (which is indeed specific to SQL Server) returns ties. TOP x WITH TIES means return TOP x records, plus all records holding the same values of the ORDER BY expressions as the last (x'th) record returned by the TOP.

With a little effort this can be used for one of the queries I wrote about yesterday, namely the one that returns all records holding MIN(orderer) within each group.

However, the efficiency of this solution varies greatly depending on the cardinality of both ordering and grouping columns.

Let's create a sample table:

Table creation details

Just like in the articles before that this table contains 1,000,000 records with the following fields:

  • id is the PRIMARY KEY
  • orderer is filled with random values from 1 to 10
  • glow is a low cardinality grouping field (10 distinct values)
  • ghigh is a high cardinality grouping field (10,000 distinct values)
  • stuffing is an asterisk-filled VARCHAR(200) column added to emulate payload of the actual tables

Additionally, there is one more field: lorderer. Just like orderer, it is filled with random values, but has much greater cardinality: the values are 1 to 10,000.

Now, let's try to select the records holding MIN(orderer) using the two approaches.

Low cardinality orderers


Using DENSE_RANK

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
FROM    [20091201_ties].t_distinct d
) dd
WHERE   dr = 1

View query details

As in the yesterday's table, this query runs for 41 second. The plan used by the engine involves partitioning, sorting and gathering the streams, but the real problem is that the table is large and a fullscan is used. Just too much data to be read.

Now, let's try the same query using CROSS APPLY / TOP … WITH TIES.

Using CROSS APPLY / TOP … WITH TIES

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  DISTINCT glow
FROM    [20091201_ties].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 WITH TIES
stuffing
FROM    [20091201_ties].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
orderer
) ds

View query details

This query runs for… um, 117 seconds. Thrice as slow.

Again: there are too many records. The TOP uses the index to return first rows, which is normally fast but not when there are 10,000 records to return. Using the index implies the need to lookup the values of stuffing (which we use in this query) in the actual table. You can see it in the plan above as a Nested Loops branch which joins the index (Index Seek) and the actual table (Clustered Index Seek). The overhead required for joining the index and the table by far outweighs all benefits of having preordered data.

However, this is only true for the orderers with low cardinality.

High cardinality orderers


Let's try the same approaches but this time let's use them to return the records holding MIN(lorderer). This field has much higher cardinality and there are few records per group that hold the MIN(lorderer) value.

Using DENSE_RANK

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY lorderer) AS dr
FROM    [20091201_ties].t_distinct d
) dd
WHERE   dr = 1

View query details

40 seconds.

The performance is just as poor. Since we have exactly same full table scan with the exactly same million rows to sort, it's quite predictable.

Using CROSS APPLY / TOP … WITH TIES

SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
SELECT  DISTINCT glow
FROM    [20091201_ties].t_distinct d
) dd
CROSS APPLY
(
SELECT  TOP 1 WITH TIES
stuffing
FROM    [20091201_ties].t_distinct di
WHERE   di.glow = dd.glow
ORDER BY
lorderer
) ds

View query details

Now, it's completely other story. With this ordering condition, we have 10 ties per group (instead of 10,000). The index is a real benefit now, and the query completes in only 250 ms.

If we replace DISTINCT with a recursive CTE (the approach described in the previous article), the query can be improved yet a little more:

WITH    rows AS
(
SELECT  MIN(glow) AS glow
FROM    [20091201_ties].t_distinct
UNION ALL
SELECT  glow
FROM    (
SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
FROM    rows r
JOIN    [20091130_groupwise].t_distinct di
ON      di.glow > r.glow
WHERE   r.glow IS NOT NULL
) q
WHERE   rn = 1
)
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    rows r
CROSS APPLY
(
SELECT  TOP 1 WITH TIES
stuffing
FROM    [20091201_ties].t_distinct di
WHERE   di.glow = r.glow
ORDER BY
lorderer
) ds

View query details

5 ms.

Now that's what I call performance.

Summary


To retrieve all records holding group-wise minimums or maximums (including ties), two approaches can be used in SQL Server:

  1. Use DENSE_RANK partitioned by the grouper and return the rows holding the dense rank of 1
  2. Use DISTINCT or a recursive CTE to return a list of groupers and CROSS APPLY a query with TOP 1 WITH TIES to each grouper

The first method requires sorting on (grouper, orderer), while the second one requires index scan (and implied index lookups) for each grouper.

Sorting the large amounts of data is more efficient than using an index lookup to retrieve them. However, index scan processes only the portion of data (within the tie) while sorting requires them all.

For low cardinality orderers (the values inside the MIN and MAX), too much records comprise a tie, and index scan is less efficient than a table scan with sorting. For these orderers, DENSE_RANK solution is more efficient.

For high cardinality orderers, the index scan returns only a small portion of records and it's almost instant. For these orderers, a CROSS APPLY / TOP 1 WITH TIES is more efficient.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

December 1st, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply