<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>EXPLAIN EXTENDED &#187; SQL Server</title>
	<atom:link href="http://explainextended.com/category/sqlserver/feed/" rel="self" type="application/rss+xml" />
	<link>http://explainextended.com</link>
	<description>How to create fast database queries</description>
	<lastBuildDate>Mon, 02 Jan 2012 00:31:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>What&#8217;s UNPIVOT good for?</title>
		<link>http://explainextended.com/2011/06/30/whats-unpivot-good-for/</link>
		<comments>http://explainextended.com/2011/06/30/whats-unpivot-good-for/#comments</comments>
		<pubDate>Thu, 30 Jun 2011 19:00:56 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5373</guid>
		<description><![CDATA[A practical use for UNPIVOT]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Karen</strong> asks:</p>
<blockquote>
<p>… I&#8217;ve always thought <code>PIVOT</code> and <code>UNPIVOT</code> are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?</p>
</blockquote>
<p>I&#8217;ve made an actual use for them in a project I&#8217;ve been working on for the last several months (which is partly why there were no updates for so long!)</p>
<p>Part of the project is a task management system where each task has several persons related to it. There can be the creator of the task, the person the task is assigned to, the actual author of the task (on behalf of whom the task is created), and the task can be possible completed by a person or deleted by a person. A total of 5 fields related to persons.</p>
<p>Now, we need to take all tasks within a certain time range and list all people involved in them.</p>
<p>Let&#8217;s create a sample table and see how would we do that.</p>
<p><span id="more-5373"></span></p>
<p><a href="#" onclick="xcollapse('X1598');return false;">Table creation details</a><br />
</p>
<div id="X1598" style="display: none; background: transparent;">
<pre class="brush: sql">
SET NOCOUNT ON
GO

DROP TABLE
        [20110630_unpivot].task

DROP SCHEMA
        [20110630_unpivot]
GO

CREATE SCHEMA
        [20110630_unpivot]

CREATE TABLE
        [20110630_unpivot].Task
        (
        id INT NOT NULL PRIMARY KEY IDENTITY,
        ts DATETIME NOT NULL,
        createdBy INT NOT NULL,
        assignedTo INT NOT NULL,
        onBehalfOf INT NOT NULL,
        completedBy INT,
        deletedBy INT,
        stuffing CHAR(1000) NOT NULL DEFAULT &#039;&#039;
        )
GO

CREATE INDEX
        IX_Task_Ts
ON      [20110630_unpivot].Task(ts)
GO

BEGIN TRANSACTION

SELECT  RAND(20110630)

DECLARE @cnt INT

SET @cnt = 0

WHILE @cnt &lt; 1000000
BEGIN
        INSERT
        INTO    [20110630_unpivot].Task
                (
                ts, createdBy, assignedTo, onBehalfOf, completedBy, deletedBy
                )
        VALUES  (
                DATEADD(s, -@cnt, DATEADD(d, 1, &#039;2011-06-30&#039;)),
                RAND() * 100000,
                RAND() * 100000,
                RAND() * 100000,
                RAND() * 100000,
                RAND() * 100000
                )
        SET @cnt = @cnt + 1
END

COMMIT

GO
</pre>
</div>
<p>There are 5 persons fields, timestamp and stuffing. The timestamp field is indexed.</p>
<p>Now, let&#8217;s find all people involved in the tasks between <strong>2011-06-30</strong> and <strong>2011-06-30 04:00:00</strong>. To do this, we could just use 5 queries (each selecting one of the persons) and <code>UNION</code> them:</p>
<pre class="brush: sql">
SELECT  SUM(CAST(id AS BIGINT)), COUNT(*)
FROM    (
        SELECT  createdBy AS id
        FROM    [20110630_unpivot].Task
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
        UNION
        SELECT  assignedTo AS id
        FROM    [20110630_unpivot].Task
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
        UNION
        SELECT  onBehalfOf AS id
        FROM    [20110630_unpivot].Task
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
        UNION
        SELECT  completedBy AS id
        FROM    [20110630_unpivot].Task
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
                AND completedBy IS NOT NULL
        UNION
        SELECT  deletedBy AS id
        FROM    [20110630_unpivot].Task
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
                AND deletedBy IS NOT NULL
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th></th>
<th></th>
</tr>
<tr>
<td class="bigint">2573160101</td>
<td class="int">51331</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0009s (0.4251s)</td>
</tr>
</table>
</div>
<pre>
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 15, logical reads 220765, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 484 ms,  elapsed time = 410 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [globalagg1019]=(0) THEN NULL ELSE [globalagg1021] END, [Expr1017]=CONVERT_IMPLICIT(int,[globalagg1023],0)))
       |--Stream Aggregate(DEFINE:([globalagg1019]=SUM([partialagg1018]), [globalagg1021]=SUM([partialagg1020]), [globalagg1023]=SUM([partialagg1022])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1022]=[partialagg1018]))
                      |--Stream Aggregate(DEFINE:([partialagg1018]=Count(*), [partialagg1020]=SUM(CONVERT(bigint,[Union1015],0))))
                           |--Hash Match(Aggregate, HASH:([Union1015]), RESIDUAL:([Union1015] = [Union1015]))
                                |--Concatenation
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[createdBy]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1025]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[assignedTo]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1026]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[onBehalfOf]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1027]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[completedBy]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1028]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]),  WHERE:([ee].[20110630_unpivot].[Task].[completedBy] IS NOT NULL) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[deletedBy]))
                                          |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1029]) OPTIMIZED WITH UNORDERED PREFETCH)
                                               |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                               |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]),  WHERE:([ee].[20110630_unpivot].[Task].[deletedBy] IS NOT NULL) LOOKUP ORDERED FORWARD)
</pre>
<p>Now we have <strong>51331</strong> persons returned in <strong>410 ms</strong>, which required <strong>220765</strong> logical reads.</p>
<p>As we can see in the plan, the table is scanned 5 times (once for each query). This is not very efficient of course. It would be much better if each matching record in the table would be only visited once.</p>
<p>This is where <code>UNPIVOT</code> comes into play.</p>
<p>As its name suggests, it does the reverse of <code>PIVOT</code>, that is moves data from multiple columns to multiple columns. And this is exactly what we need in our case.</p>
<p>Let&#8217;s try it:</p>
<pre class="brush: sql">
SELECT  SUM(CAST(personId AS BIGINT)), COUNT(*)
FROM    (
        SELECT  DISTINCT personId
        FROM    [20110630_unpivot].Task
        UNPIVOT
                (
                personId FOR personType IN
                (createdBy, assignedTo, onBehalfOf, completedBy, deletedBy)
                ) p
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th></th>
<th></th>
</tr>
<tr>
<td class="bigint">2573160101</td>
<td class="int">51331</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.1802s)</td>
</tr>
</table>
</div>
<pre>
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 165 ms,  elapsed time = 180 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1013]=CASE WHEN [globalagg1016]=(0) THEN NULL ELSE [globalagg1018] END, [Expr1014]=CONVERT_IMPLICIT(int,[globalagg1020],0)))
       |--Stream Aggregate(DEFINE:([globalagg1016]=SUM([partialagg1015]), [globalagg1018]=SUM([partialagg1017]), [globalagg1020]=SUM([partialagg1019])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1019]=[partialagg1015]))
                      |--Stream Aggregate(DEFINE:([partialagg1015]=Count(*), [partialagg1017]=SUM(CONVERT(bigint,[Expr1011],0))))
                           |--Sort(DISTINCT ORDER BY:([Expr1011] ASC))
                                |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
                                     |--Hash Match(Partial Aggregate, HASH:([Expr1011]), RESIDUAL:([Expr1011] = [Expr1011]))
                                          |--Filter(WHERE:([Expr1011] IS NOT NULL))
                                               |--Nested Loops(Left Outer Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy]))
                                                    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1022]) OPTIMIZED WITH UNORDERED PREFETCH)
                                                    |    |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                                    |    |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                                    |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))
</pre>
<p>As we can see, this is more than twice as efficient as a <code>UNION</code> and only takes <strong>44153</strong> page reads to complete.</p>
<p>How does it work internally?</p>
<p>We see in the plan that there is a nested loop between a result of a clustered index seek and something called <code>Constant Scan</code>. The constant scan returns 5 values in each loop and those are the fields listed in the <code>UNPIVOT</code> clause. It just takes each record and outputs fields found there, without rereading the record. This is actually what we wanted.</p>
<p>This behavior can be made more clear if we rewrite the query a little:</p>
<pre class="brush: sql">
SELECT  SUM(CAST(personId AS BIGINT)), COUNT(*)
FROM    (
        SELECT  DISTINCT personId
        FROM    [20110630_unpivot].Task
        CROSS APPLY
                (
                SELECT  createdBy AS personId
                UNION ALL
                SELECT  assignedTo AS personId
                UNION ALL
                SELECT  onBehalfOf AS personId
                UNION ALL
                SELECT  completedBy AS personId
                UNION ALL
                SELECT  deletedBy AS personId
                ) p
        WHERE   ts &gt;= &#039;2011-06-30&#039;
                AND ts &lt; &#039;2011-06-30 04:00:00&#039;
                AND personId IS NOT NULL
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th></th>
<th></th>
</tr>
<tr>
<td class="bigint">2573160101</td>
<td class="int">51331</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.1766s)</td>
</tr>
</table>
</div>
<pre>
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 149 ms,  elapsed time = 176 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [globalagg1007]=(0) THEN NULL ELSE [globalagg1009] END, [Expr1005]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1010]=[partialagg1006]))
                      |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=SUM(CONVERT(bigint,[Union1003],0))))
                           |--Sort(DISTINCT ORDER BY:([Union1003] ASC))
                                |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Union1003]))
                                     |--Hash Match(Partial Aggregate, HASH:([Union1003]), RESIDUAL:([Union1003] = [Union1003]))
                                          |--Filter(WHERE:([Union1003] IS NOT NULL))
                                               |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy]))
                                                    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                                                    |    |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] &gt;= &#39;2011-06-30 00:00:00.000&#39; AND [ee].[20110630_unpivot].[Task].[ts] &lt; &#39;2011-06-30 04:00:00.000&#39;) ORDERED FORWARD)
                                                    |    |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                                    |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))
</pre>
<p>Here, we just take each record and explode it into 5 records using <code>CROSS APPLY</code>.</p>
<p>This yields exactly same plan, exactly same <strong>I/O</strong> and exactly same output. In fact, that&#8217;s exactly what the <code>UNPIVOT</code> query does.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2011/06/30/whats-unpivot-good-for/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2011/06/30/whats-unpivot-good-for/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2011/06/30/whats-unpivot-good-for/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>GROUP_CONCAT in SQL Server</title>
		<link>http://explainextended.com/2010/06/21/group_concat-in-sql-server/</link>
		<comments>http://explainextended.com/2010/06/21/group_concat-in-sql-server/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 19:00:49 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4818</guid>
		<description><![CDATA[I&#8217;m finally back from my vacation. Tunisia&#8217;s great: dates, Carthage, sea and stuff. Now, to the questions. Mahen asks: Create a table called Group: Group id prodname 1 X 1 Y 1 Z 2 A 2 B 2 C The resultset should look like this: id prodname 1 X,Y,Z 2 A,B,C Can you please help [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m finally back from my vacation. Tunisia&#8217;s great: dates, Carthage, sea and stuff.</p>
<p>Now, to the questions.</p>
<p><strong>Mahen</strong> asks:</p>
<blockquote><p>
Create a table called <code>Group</code>:</p>
<table class="excel">
<caption>Group</caption>
<tr>
<th>id</th>
<th>prodname</th>
</tr>
<tr>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>1</td>
<td>Y</td>
</tr>
<tr>
<td>1</td>
<td>Z</td>
</tr>
<tr>
<td>2</td>
<td>A</td>
</tr>
<tr>
<td>2</td>
<td>B</td>
</tr>
<tr>
<td>2</td>
<td>C</td>
</tr>
</table>
<p>The resultset should look like this:</p>
<table class="excel">
<tr>
<th>id</th>
<th>prodname</th>
</tr>
<tr>
<td>1</td>
<td>X,Y,Z</td>
</tr>
<tr>
<td>2</td>
<td>A,B,C</td>
</tr>
</table>
<p>Can you please help me to solve the above problem using a recursive <strong>CTE</strong>?
</p></blockquote>
<p>This is out good old friend, <code>GROUP_CONCAT</code>. It&#8217;s an aggregate function that returns all strings within a group, concatenated. It&#8217;s somewhat different from the other aggregate functions, because, first, dealing with the concatenated string can be quite a tedious task for the groups with lots of records (large strings tend to overflow), and, second, the result depends on the order of the arguments (which is normally not the case for the aggregate functions). It&#8217;s not a part of a standard <strong>SQL</strong> and as for now is implemented only by <strong>MySQL</strong> with some extra vendor-specific keywords (like <code>ORDER BY</code> within the argument list).</p>
<p>This functionality, however, is often asked for and I have written some articles about implementing this in <a href="/2009/05/02/group_concat-in-postgresql-without-aggregate-functions/"><strong>PostgreSQL</strong></a> and <a href="/2009/05/02/group_concat-in-postgresql-without-aggregate-functions/"><strong>Oracle</strong></a>.</p>
<p>Now, let&#8217;s see how to do it in <strong>SQL Server</strong>.</p>
<p>Usually, <strong>SQL Server</strong>&#8216;s <code>FOR XML</code> clause is exploited to concatenate the strings. To do this, we obtain a list of group identifiers and for each group, retrieve all it&#8217;s product names with a subquery appended with <code>FOR XML PATH('')</code>. This makes a single <code>XML</code> column out of the recordset:<br />
<span id="more-4818"></span></p>
<pre class="brush: sql">
WITH    q (id, prodname) AS
        (
        SELECT  1, &#039;X&#039;
        UNION ALL
        SELECT  1, &#039;Y&#039;
        UNION ALL
        SELECT  1, &#039;Z&#039;
        UNION ALL
        SELECT  2, &#039;A&#039;
        UNION ALL
        SELECT  2, &#039;B&#039;
        UNION ALL
        SELECT  2, &#039;C&#039;
        )
SELECT  *
FROM    (
        SELECT  DISTINCT id
        FROM    q
        ) qo
CROSS APPLY
        (
        SELECT  CASE ROW_NUMBER() OVER(ORDER BY prodname) WHEN 1 THEN &#039;&#039; ELSE &#039;, &#039; END + qi.prodname
        FROM    q qi
        WHERE   qi.id = qo.id
        ORDER BY
                prodname
        FOR XML PATH (&#039;&#039;)
        ) qi(r)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>r</th>
</tr>
<tr>
<td class="int">1</td>
<td class="ntext">X, Y, Z</td>
</tr>
<tr>
<td class="int">2</td>
<td class="ntext">A, B, C</td>
</tr>
</table>
</div>
<p>This solution works, but converting to and from <code>XML</code> is not the best way to deal with the strings: things like ampersands, angle brackets, line feeds etc. get mangled and require some additional effort to cope with.</p>
<p>However, this functionality can really be implemented using a recursive <strong>CTE</strong>.</p>
<p>To do this, we need to do the following:</p>
<ol>
<li>Assign a group-wise row number and a group-wise count to each record (in required order)</li>
<li>Select the first record (that with the row number <strong>1</strong>) from each group in the anchor part of the <code>CTE</code></li>
<li>Recursively append the next record to the previous record. The next record can be obtained by joining on <code>rn = rn + 1</code></li>
<li>Finally, select the only last record from each group (whose row number is equal to the group-wise count). It will contain the accumulated string.</li>
</ol>
<p>Here&#8217;s how we do it:</p>
<pre class="brush: sql">
WITH    q (id, prodname) AS
        (
        SELECT  1, &#039;X&#039;
        UNION ALL
        SELECT  1, &#039;Y&#039;
        UNION ALL
        SELECT  1, &#039;Z&#039;
        UNION ALL
        SELECT  2, &#039;A&#039;
        UNION ALL
        SELECT  2, &#039;B&#039;
        UNION ALL
        SELECT  2, &#039;C&#039;
        ),
        qs(id, prodname, rn, cnt) AS
        (
        SELECT  id, prodname,
                ROW_NUMBER() OVER (PARTITION BY id ORDER BY prodname),
                COUNT(*) OVER (PARTITION BY id)
        FROM    q
        ),
        t (id, prodname, gc, rn, cnt) AS
        (
        SELECT  id, prodname,
                CAST(prodname AS NVARCHAR(MAX)), rn, cnt
        FROM    qs
        WHERE   rn = 1
        UNION ALL
        SELECT  qs.id, qs.prodname,
                CAST(t.gc + &#039;, &#039; + qs.prodname AS NVARCHAR(MAX)),
                qs.rn, qs.cnt
        FROM    t
        JOIN    qs
        ON      qs.id = t.id
                AND qs.rn = t.rn + 1
        )
SELECT  id, gc
FROM    t
WHERE   rn = cnt
OPTION (MAXRECURSION 0)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>gc</th>
</tr>
<tr>
<td class="int">2</td>
<td class="ntext">A, B, C</td>
</tr>
<tr>
<td class="int">1</td>
<td class="ntext">X, Y, Z</td>
</tr>
</table>
</div>
<p>As we can see, this only deals with native <code>NVARCHAR</code> and is free from <code>XML</code> conversions.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/06/21/group_concat-in-sql-server/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/06/21/group_concat-in-sql-server/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/06/21/group_concat-in-sql-server/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SQL Server: deleting with self-referential FOREIGN KEY</title>
		<link>http://explainextended.com/2010/03/03/sql-server-deleting-with-self-referential-foreign-key/</link>
		<comments>http://explainextended.com/2010/03/03/sql-server-deleting-with-self-referential-foreign-key/#comments</comments>
		<pubDate>Wed, 03 Mar 2010 20:00:45 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4539</guid>
		<description><![CDATA[From Stack Overflow: I have an SQL Server table defined as below: TestComposite id (PK) siteUrl (PK) name parentId 1 site1 Item1 NULL 2 site1 Item2 NULL 3 site1 Folder1 NULL 4 site1 Folder1.Item1 3 5 site1 Folder1.Item2 3 6 site1 Folder1.Folder1 3 7 site1 Folder1.Folder1.Item1 6 Items and folders are stored inside the same [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2371351/composite-primary-keys-and-foreign-key-constraint-error"><strong>Stack Overflow</strong></a>:</p>
<blockquote>
<p>I have an <strong>SQL Server</strong> table defined as below:</p>
<table class="excel">
<caption>TestComposite</caption>
<tr>
<th>id (<em>PK</em>)</th>
<th>siteUrl (<em>PK</em>)</th>
<th>name</th>
<th>parentId</th>
</tr>
<tr>
<td>1</td>
<td>site1</td>
<td>Item1</td>
<td>NULL</td>
</tr>
<tr>
<td>2</td>
<td>site1</td>
<td>Item2</td>
<td>NULL</td>
</tr>
<tr>
<td>3</td>
<td>site1</td>
<td>Folder1</td>
<td>NULL</td>
</tr>
<tr>
<td>4</td>
<td>site1</td>
<td>Folder1.Item1</td>
<td>3</td>
</tr>
<tr>
<td>5</td>
<td>site1</td>
<td>Folder1.Item2</td>
<td>3</td>
</tr>
<tr>
<td>6</td>
<td>site1</td>
<td>Folder1.Folder1</td>
<td>3</td>
</tr>
<tr>
<td>7</td>
<td>site1</td>
<td>Folder1.Folder1.Item1</td>
<td>6</td>
</tr>
</table>
<p>Items and folders are stored inside the same table</p>
<p>If an item is inside a folder, the <code>parentID</code> column is the <code>id</code> of the folder.</p>
<p>I would like to be able to <code>DELETE CASCADE</code> items/folders when I delete a folder.</p>
<p>I tried to define a constraint similar to:</p>
<pre class="brush: sql">
ALTER TABLE [TestComposite]
ADD CONSTRAINT fk_parentid
FOREIGN KEY (ParentID, SiteUrl)
REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE
</pre>
<p>, but it gives me this error:</p>
<pre>
Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
</pre>
</blockquote>
<p><strong>SQL Server</strong> does support chained <code>CASCADE</code> updates, but does not allow one table to participate more that once in a chain (i. e. does not allow <em>loops</em>).</p>
<p><strong>SQL Server</strong>, unlike most other engines, optimizes cascading <strong>DML</strong> operations to be set-based which requires building a cycle-free <strong>DML</strong> order (which you can observe in the execution plan). With the loops, that would not be possible.</p>
<p>However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4539"></span><br />
<a href="#" onclick="xcollapse('X7255');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X7255" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20100303_cascade]
CREATE TABLE TestComposite (
        id INT NOT NULL,
        siteUrl NVARCHAR(255) NOT NULL,
        name NVARCHAR(MAX) NOT NULL,
        parentId INT NULL,
        PRIMARY KEY (id, siteUrl)
);
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 0
WHILE @cnt &lt; 50000
BEGIN
        INSERT
        INTO    [20100303_cascade].TestComposite (id, siteUrl, name, parentID)
        VALUES  (
                @cnt / 10 + 1,
                &#039;site&#039; + CAST((@cnt % 10 + 1) AS NVARCHAR(255)),
                &#039;name &#039; + CAST(@cnt / 10 + 1 AS NVARCHAR(MAX)),
                CASE WHEN @cnt &lt; 50 THEN NULL ELSE @cnt / 50 END
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
</div>
<p>This table contains <strong>50,000</strong> records forming a hierarchy.</p>
<p>If we try to delete an entry that has some children, we&#8217;ll fail:</p>
<pre class="brush: sql">
DELETE
FROM    [20100303_cascade].TestComposite
WHERE   id = 42
        AND siteUrl = &#039;site1&#039;
</pre>
<pre>
Msg 547, Level 16, State 0, Line 1
The DELETE statement conflicted with the SAME TABLE REFERENCE constraint "fk_TestComposite_self". The conflict occurred in database "test", table "20100303_cascade.TestComposite".
The statement has been terminated.
</pre>
<p>To delete an item and all of its children, we should build a hierarchical query to retrieve the whole branch, and delete the branch all at once.</p>
<p>To do it, we just semi-join the table to the results of the recursive <strong>CTE</strong>:</p>
<pre class="brush: sql">
WITH    q AS
        (
        SELECT  id, siteUrl
        FROM    [20100303_cascade].TestComposite
        WHERE   id = 42
                AND siteUrl = &#039;site1&#039;
        UNION ALL
        SELECT  tc.id, tc.siteUrl
        FROM    q
        JOIN    [20100303_cascade].TestComposite tc
        ON      tc.parentID = q.id
                AND tc.siteUrl = q.siteUrl
        )
DELETE
FROM    [20100303_cascade].TestComposite
OUTPUT  DELETED.*
WHERE   EXISTS
        (
        SELECT  id, siteUrl
        INTERSECT
        SELECT  id, siteUrl
        FROM    q
        )
</pre>
<p><a href="#" onclick="xcollapse('X4752');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X4752" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>siteUrl</th>
<th>name</th>
<th>parentId</th>
</tr>
<tr>
<td class="int">42</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 42</td>
<td class="int">8</td>
</tr>
<tr>
<td class="int">211</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 211</td>
<td class="int">42</td>
</tr>
<tr>
<td class="int">212</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 212</td>
<td class="int">42</td>
</tr>
<tr>
<td class="int">213</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 213</td>
<td class="int">42</td>
</tr>
<tr>
<td class="int">214</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 214</td>
<td class="int">42</td>
</tr>
<tr>
<td class="int">215</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 215</td>
<td class="int">42</td>
</tr>
<tr>
<td class="int">1056</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1056</td>
<td class="int">211</td>
</tr>
<tr>
<td class="int">1057</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1057</td>
<td class="int">211</td>
</tr>
<tr>
<td class="int">1058</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1058</td>
<td class="int">211</td>
</tr>
<tr>
<td class="int">1059</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1059</td>
<td class="int">211</td>
</tr>
<tr>
<td class="int">1060</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1060</td>
<td class="int">211</td>
</tr>
<tr>
<td class="int">1061</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1061</td>
<td class="int">212</td>
</tr>
<tr>
<td class="int">1062</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1062</td>
<td class="int">212</td>
</tr>
<tr>
<td class="int">1063</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1063</td>
<td class="int">212</td>
</tr>
<tr>
<td class="int">1064</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1064</td>
<td class="int">212</td>
</tr>
<tr>
<td class="int">1065</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1065</td>
<td class="int">212</td>
</tr>
<tr>
<td class="int">1066</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1066</td>
<td class="int">213</td>
</tr>
<tr>
<td class="int">1067</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1067</td>
<td class="int">213</td>
</tr>
<tr>
<td class="int">1068</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1068</td>
<td class="int">213</td>
</tr>
<tr>
<td class="int">1069</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1069</td>
<td class="int">213</td>
</tr>
<tr>
<td class="int">1070</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1070</td>
<td class="int">213</td>
</tr>
<tr>
<td class="int">1071</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1071</td>
<td class="int">214</td>
</tr>
<tr>
<td class="int">1072</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1072</td>
<td class="int">214</td>
</tr>
<tr>
<td class="int">1073</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1073</td>
<td class="int">214</td>
</tr>
<tr>
<td class="int">1074</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1074</td>
<td class="int">214</td>
</tr>
<tr>
<td class="int">1075</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1075</td>
<td class="int">214</td>
</tr>
<tr>
<td class="int">1076</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1076</td>
<td class="int">215</td>
</tr>
<tr>
<td class="int">1077</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1077</td>
<td class="int">215</td>
</tr>
<tr>
<td class="int">1078</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1078</td>
<td class="int">215</td>
</tr>
<tr>
<td class="int">1079</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1079</td>
<td class="int">215</td>
</tr>
<tr>
<td class="int">1080</td>
<td class="nvarchar">site1</td>
<td class="ntext">name 1080</td>
<td class="int">215</td>
</tr>
<tr class="statusbar">
<td colspan="100">31 rows fetched in 0.0038s (0.0234s)</td>
</tr>
</table>
</div>
<pre>
SQL Server parse and compile time:    CPU time = 0 ms, elapsed time = 1 ms.
Table 'TestComposite'. Scan count 62, logical reads 498, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 5, logical reads 256, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:   CPU time = 16 ms,  elapsed time = 4 ms.
</pre>
<pre>
  |--Sequence
       |--Table Spool
       |    |--Clustered Index Delete(OBJECT:([test].[20100303_cascade].[TestComposite].[PK__TestComposite__78E9C54B]), OBJECT:([test].[20100303_cascade].[TestComposite].[ix_TestComposite_parent_site]))
       |         |--Top(ROWCOUNT est 0)
       |              |--Sort(DISTINCT ORDER BY:([test].[20100303_cascade].[TestComposite].[id] ASC, [test].[20100303_cascade].[TestComposite].[siteUrl] ASC))
       |                   |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1010], [Recr1011]))
       |                        |--Index Spool(WITH STACK)
       |                        |    |--Concatenation
       |                        |         |--Compute Scalar(DEFINE:([Expr1019]=(0)))
       |                        |         |    |--Clustered Index Seek(OBJECT:([test].[20100303_cascade].[TestComposite].[PK__TestComposite__78E9C54B]), SEEK:([test].[20100303_cascade].[TestComposite].[id]=(42) AND [test].[20100303_cascade].[TestComposite].[siteUrl]=N'site1') ORDERED FORWARD)
       |                        |         |--Assert(WHERE:(CASE WHEN [Expr1021]&gt;(100) THEN (0) ELSE NULL END))
       |                        |              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1021], [Recr1006], [Recr1007]))
       |                        |                   |--Compute Scalar(DEFINE:([Expr1021]=[Expr1020]+(1)))
       |                        |                   |    |--Table Spool(WITH STACK)
       |                        |                   |--Index Seek(OBJECT:([test].[20100303_cascade].[TestComposite].[ix_TestComposite_parent_site] AS [tc]), SEEK:([tc].[parentId]=[Recr1006] AND [tc].[siteUrl]=[Recr1007]) ORDERED FORWARD)
       |                        |--Clustered Index Seek(OBJECT:([test].[20100303_cascade].[TestComposite].[PK__TestComposite__78E9C54B]), SEEK:([test].[20100303_cascade].[TestComposite].[id]=[Recr1010] AND [test].[20100303_cascade].[TestComposite].[siteUrl]=[Recr1011]) ORDERED FORWARD)
       |--Table Spool
       |--Assert(WHERE:(CASE WHEN NOT [Expr1018] IS NULL THEN (0) ELSE NULL END))
            |--Nested Loops(Left Semi Join, OUTER REFERENCES:([test].[20100303_cascade].[TestComposite].[id], [test].[20100303_cascade].[TestComposite].[siteUrl], [Expr1023]) WITH UNORDERED PREFETCH, DEFINE:([Expr1018] = [PROBE VALUE]))
                 |--Table Spool
                 |--Index Seek(OBJECT:([test].[20100303_cascade].[TestComposite].[ix_TestComposite_parent_site]), SEEK:([test].[20100303_cascade].[TestComposite].[parentId]=[test].[20100303_cascade].[TestComposite].[id] AND [test].[20100303_cascade].[TestComposite].[siteUrl]=[test].[20100303_cascade].[TestComposite].[siteUrl]) ORDERED FORWARD)
</pre>
</div>
<p>We see that the whole branch of <strong>31</strong> records was deleted all at once, without violating the <code>FOREIGN KEY</code>.</p>
<p>Unlike some other systems, we don&#8217;t have to worry about the order the records are deleted, since all <strong>SQL Server</strong> referential constraints are deferred till the end of the query.</p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/03/03/sql-server-deleting-with-self-referential-foreign-key/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/03/03/sql-server-deleting-with-self-referential-foreign-key/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/03/03/sql-server-deleting-with-self-referential-foreign-key/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Efficient circle distance testing</title>
		<link>http://explainextended.com/2010/02/26/efficient-circle-distance-testing/</link>
		<comments>http://explainextended.com/2010/02/26/efficient-circle-distance-testing/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 20:00:41 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4449</guid>
		<description><![CDATA[Answering questions asked on the site. eptil asks: I am using SQL Server 2008, but not the spatial features. I have a table with few entries, only 40,000. There is an id INT PRIMARY KEY column and two columns storing a 2d coordinate, both decimals. I would like to find all the records that do [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>eptil</strong> asks:</p>
<blockquote><p>I am using <strong>SQL Server 2008</strong>, but not the spatial features.</p>
<p>I have a table with few entries, only <strong>40,000</strong>. There is an <code>id INT PRIMARY KEY</code> column and two columns storing a 2d coordinate, both decimals.</p>
<p>I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:</p>
<pre class="brush: sql">
SELECT  id, x, y
FROM    mytable t1
WHERE   (
        SELECT  COUNT(*)
        FROM    mytable t2
        WHERE   ABS(t1.x - t2.x) &lt; 25
                AND ABS(t1.y - t2.y) &lt; 25
        ) = 1
</pre>
<p><!-- --><br />
This is taking <strong>15</strong> minutes to run at times.</p>
<p>Is there a better way?
</p></blockquote>
<p>Of course using spatial abilities would be a better way, but it is possible to make do with plain <strong>SQL</strong>. This will also work in <strong>SQL Server 2005</strong>.</p>
<p>In most database engines, the spatial indexes are implemented as the <strong>R-Tree</strong> structures. <strong>SQL Server</strong>, however, uses another approach: surface tesselation.</p>
<p>Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain <strong>B-Tree</strong> index.</p>
<p>When <strong>SQL Server</strong>&#8216;s optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that <em>possibly</em> can satisfy this predicate. Say, if the tiles are defined as squares with side <strong>1</strong>, the predicate <code>column.STDistance(@mypoint) &lt; 2</code> can only be satisfied by the objects within <strong>2</strong> tiles away from <code>@mypoint</code>&#8216;s tile. This gives a square of <strong>25</strong> tiles with <code>@mypoint</code>&#8216;s tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.</p>
<p>Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don&#8217;t have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.</p>
<p>Let&#8217;s create a sample table and see how it works:<br />
<span id="more-4449"></span></p>
<pre class="brush: sql">
CREATE SCHEMA [20100226_circle]
CREATE TABLE t_circle (
        id INT NOT NULL PRIMARY KEY,
        x DECIMAL (15, 3) NOT NULL,
        y DECIMAL (15, 3) NOT NULL,
        )
GO
BEGIN TRANSACTION
SELECT  RAND(20100226)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 50000
BEGIN
        INSERT
        INTO    [20100226_circle].t_circle (id, x, y)
        VALUES  (
                @cnt,
                RAND() * 3200,
                RAND() * 3200
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
<p>The table contains <strong>50,000</strong> points on random places within a square of <strong>3,200 &times; 3,200</strong> units.</p>
<p>We can optimize the original query a little by using <code>NOT EXISTS</code> instead of <code>COUNT(*)</code>:</p>
<pre class="brush: sql">
SELECT  *
FROM    [20100226_circle].t_circle co
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    [20100226_circle].t_circle ci
        WHERE   SQRT(POWER(co.x - ci.x, 2) + POWER(co.y - ci.y, 2)) &lt; 25
                AND co.id &lt;&gt; ci.id
        )
ORDER BY
        id
</pre>
<p><a href="#" onclick="xcollapse('X6526');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X6526" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>x</th>
<th>y</th>
</tr>
<tr>
<td class="int">205</td>
<td class="decimal">2247.896</td>
<td class="decimal">3198.399</td>
</tr>
<tr>
<td class="int">2867</td>
<td class="decimal">2159.626</td>
<td class="decimal">1120.590</td>
</tr>
<tr>
<td class="int">13644</td>
<td class="decimal">4.951</td>
<td class="decimal">3165.734</td>
</tr>
<tr>
<td class="int">15917</td>
<td class="decimal">2747.826</td>
<td class="decimal">3041.280</td>
</tr>
<tr>
<td class="int">25183</td>
<td class="decimal">1858.866</td>
<td class="decimal">326.416</td>
</tr>
<tr>
<td class="int">43211</td>
<td class="decimal">1176.369</td>
<td class="decimal">98.079</td>
</tr>
<tr class="statusbar">
<td colspan="100">6 rows fetched in 0.0000s (354.2165s)</td>
</tr>
</table>
</div>
<pre>
Table 'Worktable'. Scan count 2, logical reads 1555993, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_circle'. Scan count 5, logical reads 1405, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 680766 ms,  elapsed time = 354208 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams, ORDER BY:([co].[id] ASC))
       |--Nested Loops(Left Anti Semi Join, WHERE:(sqrt(CONVERT_IMPLICIT(float(53),power([test].[20100226_circle].[t_circle].[x] as [co].[x]-[test].[20100226_circle].[t_circle].[x] as [ci].[x],(2.000000000000000e+000))+power([test].[20100226_circle].[t_circle].[y] as [co].[y]-[test].[20100226_circle].[t_circle].[y] as [ci].[y],(2.000000000000000e+000)),0))&lt;(2.500000000000000e+001) AND [test].[20100226_circle].[t_circle].[id] as [co].[id]&lt;&gt;[test].[20100226_circle].[t_circle].[id] as [ci].[id]))
            |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3] AS [co]), ORDERED FORWARD)
            |--Table Spool
                 |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3] AS [ci]))
</pre>
</div>
<p>, but this would still be quite slow.</p>
<p>The problem is that the nested loops go nowhere: they are still inside the plan, but return earlier. As a result, the query takes <strong>5</strong> minutes instead of <strong>15</strong>, which is still too much.</p>
<p>To improve the query we need to make an efficient anti-join method to work, and the tesselation strategy is a way to go. Here&#8217;s what we need to do to implement this strategy:</p>
<ul>
<li>
<p>Tesselate the surface and assign a unique number to each tile. Since we need to search for the records within <strong>25</strong> units, it will be a reasonable idea to divide the surface into a number of squares <strong>25 &times; 25</strong> units in size, numbered column-wise. To find out the number of rows and columns, we should just find the <code>MIN</code> and <code>MAX</code> <code>x</code> and <code>y</code>.</p>
<p><img src="http://explainextended.com/wp-content/uploads/2010/02/tesselation.png" alt="" title="Tesselation" width="600" height="600" class="aligncenter size-full wp-image-4451 noborder" /></p>
<p>This is the sample tesselation, assuming that <code>MIN(x)</code> and <code>MIN(y)</code> are <strong>25 &times; 100 = 2500</strong> units apart (and same with <code>y</code>).</p>
</li>
<li>
<p>Find the tile each point belongs to.</p>
</li>
<li>
<p>Build a recordset that would correspond each point to each of the tiles the neighbors can <em>theoretically</em> reside in. Since a circle with radius of <strong>25</strong> units theoretically may cover up to <strong>9</strong> adjacent tiles, each tile should be corresponded to each of the <strong>9</strong> tiles forming a <strong>3 &times; 3</strong> square with a unit&#8217;s tile in the center.</p>
<p><img src="http://explainextended.com/wp-content/uploads/2010/02/coverage.png" alt="" title="Coverage" width="600" height="600" class="aligncenter size-full wp-image-4457 noborder" /></p>
</li>
<li>
<p>For each point, make sure that no other points exists within the neighboring tiles, additionally applying a fine-filtering condition.</p>
<p>This may sound redundant, since the point-neighbor combination is unique, as well as point-tile, so only one of the <strong>9</strong> candidates will satisfy the join, even if the points reside in the adjacent tiles. But finding the exact distance requires a complex expression while comparing the tile numbers is an equality correlation and as such is eligible by an efficient anti-join method like <code>HASH ANTI JOIN</code>. The coarse filtering on tiles will sieve out most of the far neighbors so that the only adjacent neighbors will require special attention.</p>
</li>
</ul>
<p>And here&#8217;s the query:</p>
<pre class="brush: sql">
WITH    extremes AS
        (
        SELECT  *,
                maxx - minx AS width,
                maxy - miny AS height
        FROM    (
                SELECT  FLOOR(MIN(x) / 25) AS minx,
                        CEILING(MAX(x) / 25) AS maxx,
                        FLOOR(MIN(y) / 25) AS miny,
                        CEILING(MAX(y) / 25) AS maxy
                FROM    [20100226_circle].t_circle
                ) q
        ),
        tileset (dim) AS
        (
        SELECT  -1
        UNION ALL
        SELECT  0
        UNION ALL
        SELECT  1
        ),
        tiles AS
        (
        SELECT  id, x, y, minx, miny, width,
                (FLOOR(x / 25) - minx) * width + FLOOR(y / 25) - miny AS tile
        FROM    extremes
        CROSS JOIN
                [20100226_circle].t_circle
        ),
        neighbors AS
        (
        SELECT  ti.*,
                (FLOOR(ti.x / 25) - ti.minx + nx.dim) * ti.width +
                FLOOR(ti.y / 25) - ti.miny + ny.dim AS mtile
        FROM    tiles ti
        CROSS JOIN
                tileset nx
        CROSS JOIN
                tileset ny
        )
SELECT  *
FROM    tiles tn
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    neighbors n
        WHERE   n.mtile = tn.tile
                AND n.id &lt;&gt; tn.id
                AND SQRT(SQUARE(n.x - tn.x) + SQUARE(n.y - tn.y)) &lt; 25
        )
ORDER BY
        id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>x</th>
<th>y</th>
<th>minx</th>
<th>miny</th>
<th>width</th>
<th>tile</th>
</tr>
<tr>
<td class="int">205</td>
<td class="decimal">2247.896</td>
<td class="decimal">3198.399</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">11519</td>
</tr>
<tr>
<td class="int">2867</td>
<td class="decimal">2159.626</td>
<td class="decimal">1120.590</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">11052</td>
</tr>
<tr>
<td class="int">13644</td>
<td class="decimal">4.951</td>
<td class="decimal">3165.734</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">126</td>
</tr>
<tr>
<td class="int">15917</td>
<td class="decimal">2747.826</td>
<td class="decimal">3041.280</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">14073</td>
</tr>
<tr>
<td class="int">25183</td>
<td class="decimal">1858.866</td>
<td class="decimal">326.416</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">9485</td>
</tr>
<tr>
<td class="int">43211</td>
<td class="decimal">1176.369</td>
<td class="decimal">98.079</td>
<td class="decimal">0</td>
<td class="decimal">0</td>
<td class="decimal">128</td>
<td class="decimal">6019</td>
</tr>
<tr class="statusbar">
<td colspan="100">6 rows fetched in 0.0017s (3.0781s)</td>
</tr>
</table>
</div>
<pre>
Table 't_circle'. Scan count 11, logical reads 3087, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 207406, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 4235 ms,  elapsed time = 3068 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams, ORDER BY:([test].[20100226_circle].[t_circle].[id] ASC))
       |--Sort(ORDER BY:([test].[20100226_circle].[t_circle].[id] ASC))
            |--Compute Scalar(DEFINE:([Expr1007]=floor([Expr1003]/(25.)), [Expr1009]=floor([Expr1005]/(25.)), [Expr1011]=ceiling([Expr1004]/(25.))-floor([Expr1003]/(25.)), [Expr1016]=(([Expr1044]-floor([Expr1003]/(25.)))*(ceiling([Expr1004]/(25.))-floor([Expr1003]/(25.)))+[Expr1045])-floor([Expr1005]/(25.))))
                 |--Hash Match(Left Anti Semi Join, HASH:([Expr1055])=([Expr1054]), RESIDUAL:([Expr1054]=[Expr1055] AND [test].[20100226_circle].[t_circle].[id]&lt;&gt;[test].[20100226_circle].[t_circle].[id] AND sqrt(square(CONVERT_IMPLICIT(float(53),[test].[20100226_circle].[t_circle].[x]-[test].[20100226_circle].[t_circle].[x],0))+square(CONVERT_IMPLICIT(float(53),[test].[20100226_circle].[t_circle].[y]-[test].[20100226_circle].[t_circle].[y],0)))&lt;(2.500000000000000e+001)))
                      |--Bitmap(HASH:([Expr1055]), DEFINE:([Bitmap1056]))
                      |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1055]))
                      |         |--Compute Scalar(DEFINE:([Expr1055]=(([Expr1044]-floor([Expr1003]/(25.)))*(ceiling([Expr1004]/(25.))-floor([Expr1003]/(25.)))+[Expr1045])-floor([Expr1005]/(25.))))
                      |              |--Nested Loops(Inner Join)
                      |                   |--Parallelism(Distribute Streams, Broadcast Partitioning)
                      |                   |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([partialagg1048]), [Expr1004]=MAX([partialagg1049]), [Expr1005]=MIN([partialagg1050])))
                      |                   |         |--Parallelism(Gather Streams)
                      |                   |              |--Stream Aggregate(DEFINE:([partialagg1048]=MIN([test].[20100226_circle].[t_circle].[x]), [partialagg1049]=MAX([test].[20100226_circle].[t_circle].[x]), [partialagg1050]=MIN([test].[20100226_circle].[t_circle].[y])))
                      |                   |                   |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3]))
                      |                   |--Compute Scalar(DEFINE:([Expr1044]=floor([test].[20100226_circle].[t_circle].[x]/(25.)), [Expr1045]=floor([test].[20100226_circle].[t_circle].[y]/(25.))))
                      |                        |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3]))
                      |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1054]), WHERE:(PROBE([Bitmap1056])=TRUE))
                           |--Compute Scalar(DEFINE:([Expr1054]=(((([Expr1046]-floor([Expr1020]/(25.)))+CONVERT_IMPLICIT(decimal(10,0),[Union1037],0))*(ceiling([Expr1021]/(25.))-floor([Expr1020]/(25.)))+[Expr1047])-floor([Expr1022]/(25.)))+CONVERT_IMPLICIT(decimal(10,0),[Union1041],0)))
                                |--Nested Loops(Inner Join)
                                     |--Nested Loops(Inner Join)
                                     |    |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |    |    |--Nested Loops(Inner Join)
                                     |    |         |--Stream Aggregate(DEFINE:([Expr1020]=MIN([partialagg1051]), [Expr1021]=MAX([partialagg1052]), [Expr1022]=MIN([partialagg1053])))
                                     |    |         |    |--Parallelism(Gather Streams)
                                     |    |         |         |--Stream Aggregate(DEFINE:([partialagg1051]=MIN([test].[20100226_circle].[t_circle].[x]), [partialagg1052]=MAX([test].[20100226_circle].[t_circle].[x]), [partialagg1053]=MIN([test].[20100226_circle].[t_circle].[y])))
                                     |    |         |              |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3]))
                                     |    |         |--Constant Scan(VALUES:(((-1)),((0)),((1))))
                                     |    |--Constant Scan(VALUES:(((-1)),((0)),((1))))
                                     |--Table Spool
                                          |--Compute Scalar(DEFINE:([Expr1046]=floor([test].[20100226_circle].[t_circle].[x]/(25.)), [Expr1047]=floor([test].[20100226_circle].[t_circle].[y]/(25.))))
                                               |--Clustered Index Scan(OBJECT:([test].[20100226_circle].[t_circle].[PK__t_circle__62065FF3]))
</pre>
<p>This query only takes <strong>3 seconds</strong>.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/02/26/efficient-circle-distance-testing/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/02/26/efficient-circle-distance-testing/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/26/efficient-circle-distance-testing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SQL Server: EXCEPT ALL</title>
		<link>http://explainextended.com/2010/02/10/sql-server-except-all/</link>
		<comments>http://explainextended.com/2010/02/10/sql-server-except-all/#comments</comments>
		<pubDate>Wed, 10 Feb 2010 20:00:52 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4270</guid>
		<description><![CDATA[Answering questions asked on the site. myst asks: I have two really large tables with lots of columns, many of them are nullable I need to remove all rows from table1 which are not present in table2, but there can be duplicates in both tables and writing more than 70 IS NULL conditions would be [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>myst</strong> asks:</p>
<blockquote><p>I have two really large tables with lots of columns, many of them are nullable</p>
<p>I need to remove all rows from <code>table1</code> which are not present in <code>table2</code>, but there can be duplicates in both tables and writing more than <strong>70</strong> <code>IS NULL</code> conditions would be a pain, and I want to make sure there&#8217;s nothing I&#8217;m missing.</p>
<p>Is there a more simple way?
</p></blockquote>
<p><strong>SQL Server</strong> supports <code>EXCEPT</code> clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a <code>DML</code> operation.</p>
<p><strong>ANSI SQL</strong> standard describes <code>EXCEPT ALL</code> which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, <strong>SQL Server</strong> does not support this operator.</p>
<p>Similar behavior can be achieved using <code>NOT IN</code> or <code>NOT EXISTS</code> constructs. But in <strong>SQL Server</strong>, <code>IN</code> predicate does not accept more than one field. <strong>NOT EXISTS</strong> accepts any number of correlated columns, but it requires extra checks in the <code>WHERE</code> clause, since equality operator does not treat two <strong>NULL</strong> values as equal. Each pair or nullable columns should be additionally checked for a <strong>NULL</strong> in both fields. This can only be done using <code>OR</code> predicates or <code>COALESCE</code>, neither of which adds to performance.</p>
<p>But there is a way to emulate <code>EXCEPT ALL</code> in <strong>SQL Server</strong> quite elegantly and efficiently.</p>
<p>Let&#8217;s create two sample tables:<br />
<span id="more-4270"></span><br />
<a href="#" onclick="xcollapse('X5455');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X5455" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20100211_except]
CREATE TABLE t1 (
        val1 INT,
        val2 INT,
        val3 INT,
        val4 INT,
        val5 INT
)
CREATE TABLE t2 (
        val1 INT,
        val2 INT,
        val3 INT,
        val4 INT,
        val5 INT
)
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 0
WHILE @cnt &lt; 100000
BEGIN
        INSERT
        INTO    [20100211_except].t1
        VALUES  (
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                CASE WHEN @cnt % 2 = 0 THEN FLOOR(@cnt / 10) ELSE NULL END
                )
        SET @cnt = @cnt + 1
END
INSERT
INTO    [20100211_except].t2
SELECT  TOP 99980 *
FROM    [20100211_except].t1
ORDER BY
        val1, val2, val3, val4, val5
COMMIT
GO
</pre>
</div>
<p>The second table is <strong>20</strong> records short of being the full copy of the first table. There are <strong>NULL</strong> values and duplicates in both tables.</p>
<p>The <code>EXCEPT</code> query returns this:</p>
<pre class="brush: sql">
SELECT  *
FROM    [20100211_except].t1 AS t1
EXCEPT
SELECT  *
FROM    [20100211_except].t2 AS t2
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>val1</th>
<th>val2</th>
<th>val3</th>
<th>val4</th>
<th>val5</th>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr class="statusbar">
<td colspan="100">4 rows fetched in 0.0011s (0.1933s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X3137');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X3137" style="display: none; ">
<pre>
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 376 ms,  elapsed time = 190 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Hash Match(Aggregate, HASH:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t1].[val1] as [t1].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t1].[val2] as [t1].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t1].[val3] as [t1].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t1].[val4] as [t1].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t1].[val5] as [t1].[val5]))
                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                      |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
</pre>
</div>
<p>There are only <strong>4</strong> records in the resultset, since the duplicates are eliminated. And you cannot delete from this resultset.</p>
<p>Here&#8217;s the <code>EXISTS</code> workaround:</p>
<pre class="brush: sql">
SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    [20100211_except].t2 AS t2
        WHERE   1 = 1
                AND (t1.val1 = t2.val1 OR (t1.val1 IS NULL AND t2.val1 IS NULL))
                AND (t1.val2 = t2.val2 OR (t1.val2 IS NULL AND t2.val2 IS NULL))
                AND (t1.val3 = t2.val3 OR (t1.val3 IS NULL AND t2.val3 IS NULL))
                AND (t1.val4 = t2.val4 OR (t1.val4 IS NULL AND t2.val4 IS NULL))
                AND (t1.val5 = t2.val5 OR (t1.val5 IS NULL AND t2.val5 IS NULL))
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>val1</th>
<th>val2</th>
<th>val3</th>
<th>val4</th>
<th>val5</th>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0029s (0.1953s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X7804');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X7804" style="display: none; ">
<pre>
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 344 ms,  elapsed time = 194 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                 |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
</pre>
</div>
<p>This returns all records, but you just look at that query. And it&#8217;s only <strong>5</strong> columns, not <strong>70</strong>.</p>
<p>To emulate <code>EXCEPT ALL</code>, we can use a little trick.</p>
<p>Unlike <strong>PostgreSQL</strong>, <strong>SQL Server</strong> does not support record types. You cannot return a whole record in one field. But it is possible to return a whole record in a tableless <code>SELECT</code> in the correlated subquery. And plain <code>EXCEPT</code> operator can be applied to this <code>SELECT</code>.</p>
<p>Using this trick, <code>EXCEPT</code> may be applied to each individual record from the first table, not to the table as a whole. If this record is presented among the other table&#8217;s records, nothing will be returned by the operator; otherwise a single row which was initially supplied will be returned. This can be filtered using <code>EXISTS</code> clause.</p>
<p>Here&#8217;s how it looks:</p>
<pre class="brush: sql">
SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
        (
        SELECT  t1.*
        EXCEPT
        SELECT  *
        FROM    [20100211_except].t2 AS t2
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>val1</th>
<th>val2</th>
<th>val3</th>
<th>val4</th>
<th>val5</th>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0033s (0.7656s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X8269');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8269" style="display: none; ">
<pre>
Table 'Worktable'. Scan count 100000, logical reads 555796, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't2'. Scan count 1, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 984 ms,  elapsed time = 758 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Nested Loops(Left Semi Join, OUTER REFERENCES:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
            |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
            |--Nested Loops(Left Anti Semi Join)
                 |--Constant Scan
                 |--Index Spool(SEEK:([t2].[val1]=[test].[20100211_except].[t1].[val1] as [t1].[val1] AND [t2].[val2]=[test].[20100211_except].[t1].[val2] as [t1].[val2] AND [t2].[val3]=[test].[20100211_except].[t1].[val3] as [t1].[val3] AND [t2].[val4]=[test].[20100211_except].[t1].[val4] as [t1].[val4] AND [t2].[val5]=[test].[20100211_except].[t1].[val5] as [t1].[val5]))
                      |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
</pre>
</div>
<p>All <strong>20</strong> records are on their places.</p>
<p>Note that this query didn&#8217;t use any actual field names at all: only pure set-based operators.</p>
<p>This is a little bit less efficient than a normal field-based <code>EXISTS</code>, because this query is forced to use <code>Nested Loops</code>. However, with an index that is built in-place by the <code>Eager Spool</code> operation, this is still quite fast.</p>
<p>This query behaves just like <code>EXCEPT ALL</code> but formally, it&#8217;s a single-table <code>SELECT</code> query without joins or aggregations. And as such, it can be used in the <code>DML</code>.</p>
<p>Use this query to delete all records <strong>absent</strong> in the second table:</p>
<pre class="brush: sql">
DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
        (
        SELECT  t1.*
        EXCEPT
        SELECT  *
        FROM    [20100211_except].t2
        )
</pre>
<p>, and this one to delete records <strong>present</strong> in the second table:</p>
<pre class="brush: sql">
DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
        (
        SELECT  t1.*
        EXCEPT
        SELECT  *
        FROM    [20100211_except].t2
        )
</pre>
<p>Beware: <code>EXISTS</code> predicate returns absent records, and <code>NOT EXISTS</code> returns present records.</p>
<p>This is not very intuitive, but may become more clear if you realize that the subquery effectively checks for <strong>absence</strong> of the current row, so the <code>EXISTS</code> predicate in fact says <q>where exists the <strong>absence</strong> of the current row</q>.</p>
<p>Hope that helps.</p>
<h3>Update of Feb 12th, 2009</h3>
<p>In a feedback, <strong>Gonçalo Ferreira</strong> suggested using <code>INTERSECT</code> instead of <code>EXCEPT</code> in the inner subquery:</p>
<blockquote><p>Wouldn&#8217;t using <code>INTERSECT</code> in the sub-query avoid the <q><code>EXISTS</code> &#8211; absent / <code>NOT EXISTS</code> &#8211; present</q> confusion?
</p></blockquote>
<p>Let&#8217;s try it:</p>
<pre class="brush: sql">
SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
        (
        SELECT  t1.*
        INTERSECT
        SELECT  *
        FROM    [20100211_except].t2 AS t2
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>val1</th>
<th>val2</th>
<th>val3</th>
<th>val4</th>
<th>val5</th>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int">9998</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int">9999</td>
<td class="int"></td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0027s (0.1754s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X7443');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X7443" style="display: none; ">
<pre>
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 343 ms,  elapsed time = 174 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                 |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
</pre>
</div>
<p>Not only this is more readable, but this is also much faster, since it uses a single <code>Hash Match (Right Anti Semi Join)</code> (which is the same thing as the equality <code>EXISTS</code> clause used).</p>
<p>Nice point, Gonçalo!</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/02/10/sql-server-except-all/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/02/10/sql-server-except-all/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/10/sql-server-except-all/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SQL Server: running totals</title>
		<link>http://explainextended.com/2010/01/22/sql-server-running-totals/</link>
		<comments>http://explainextended.com/2010/01/22/sql-server-running-totals/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 20:00:18 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4009</guid>
		<description><![CDATA[From Stack Overflow: We have a table of transactions which is structured like the following : transactions TranxID ItemID TranxDate TranxAmt TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by. Getting the current total is obviously simple, but what [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2118829/performant-way-to-get-the-maximum-value-of-a-running-total-in-tsql"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>We have a table of transactions which is structured like the following :</p>
<table class="excel">
<caption>transactions</caption>
<tr>
<th>TranxID</th>
<th>ItemID</th>
<th>TranxDate</th>
<th>TranxAmt</th>
</tr>
</table>
<p><code>TranxAmt</code> can be positive or negative, so the running total of this field (for any <code>ItemID</code>) will go up and down as time goes by.</p>
<p>Getting the current total is obviously simple, but what I&#8217;m after is a performant way of getting the highest value of the running total and the <code>TranxDate</code> when this occurred.</p>
<p>Note that TranxDate is not unique.
</p></blockquote>
<p><strong>SQL Server</strong> is a very nice system, but using it for calculating running totals is a pain.</p>
<p><strong>Oracle</strong> supports additional clauses for analytic functions, <code>RANGE</code> and <code>ROWS</code>, which define the boundaries of the function&#8217;s windows and hence can be used to implement running totals. By default, it is just enough to omit the <code>RANGE</code> clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.</p>
<p><strong>SQL Server</strong>&#8216;s support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like <code>SUM</code> and <code>COUNT</code> it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.</p>
<p>The common way to write such a running total query is using a subquery or a self join which would count the <code>SUM</code> of all previous records. However, the complexity of this query is <code>O(n^2)</code> and it&#8217;s not usable for any real volumes of data.</p>
<p>This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.</p>
<p>This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4009"></span><br />
<a href="#" onclick="xcollapse('X570');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X570" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20100122_running]
CREATE TABLE t_transaction (
        TranxID INT NOT NULL PRIMARY KEY,
        ItemID INT NOT NULL,
        TranxDate DATETIME NOT NULL,
        TranxAmt MONEY NOT NULL
)
GO
CREATE INDEX IX_transaction_date__amt ON [20100122_running].t_transaction (ItemId, TranxDate) INCLUDE (TranxAmt)
GO
BEGIN TRANSACTION
SELECT  RAND(20100122)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 100000
BEGIN
        INSERT
        INTO    [20100122_running].t_transaction
        VALUES  (
                @cnt,
                (@cnt - 1) % 5 + 1,
                DATEADD(day, -RAND() * 5000, &#039;2010-01-22&#039;),
                (RAND() - 0.5) * 100
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
</div>
<p>There are <strong>100,000</strong> records with <strong>5</strong> <code>ItemID</code>&#8216;s (<strong>20,000</strong> records per <code>ItemID</code>). The dates and the values are random, with <strong>5,000</strong> possible dates.</p>
<p>Now, the idea is to develop a recursive query that would iterate though dates and on each date, add the sum of that date&#8217;s transactions.</p>
<p>Here&#8217;s how we do it (in steps):</p>
<h3>Grouping the records</h3>
<p>The first thing we need to go is group the records by <code>ItemID</code> and <code>TransDate</code>:</p>
<pre class="brush: sql">
WITH    q AS
        (
        SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
        FROM    [20100122_running].t_transaction
        GROUP BY
                ItemID, TranxDate
        )
</pre>
<p>This kills two birds: first, this is a part of a query we will need to obtain the seed values to begin recursion with, second, we get a set of records that are unique on <code>(ItemID, TranxDate)</code> (this will be required a little bit later)</p>
<h3>Retrieving the seed values</h3>
<p>A recursive <strong>CTE</strong> should have an anchor part which is a base for recursion. To get the initial values of dates and sums we need to <a href="http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/">select the records holding the minimal values</a> of <code>TranxDate</code> for each <code>ItemID</code>. The link above describes two approaches to that in more detail, and here I&#8217;ll just proved the query:</p>
<pre class="brush: sql">
WITH    q AS
        (
        SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
        FROM    [20100122_running].t_transaction
        GROUP BY
                ItemID, TranxDate
        ),
        SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
        FROM    (
                SELECT  ItemID, MAX(TranxDate) AS LastDate
                FROM    q
                GROUP BY
                        ItemID
                ) qo
        CROSS APPLY
                (
                SELECT  TOP 1 TranxDate, TranxSum
                FROM    q qi
                WHERE   qi.ItemID = qo.ItemId
                ORDER BY
                        TranxDate
                ) qn
        )
</pre>
<p>Note that I use <code>GROUP BY</code> instead of <code>DISTINCT</code> in the subquery that selects the <code>ItemID</code>&#8216;s. This allows to select the value of <code>MAX(TranxDate)</code> along with each <code>ItemID</code> which could be used as a stop condition in the recursive <strong>CTE</strong>.</p>
<h3>Calculating the running SUM</h3>
<p>To calculate the running <code>SUM</code> we need to run the query recursively, adding one day to <code>TranxDate</code> until it reaches <code>LastDate</code> (which we selected with the previous query).</p>
<p>The value of the running <code>SUM</code> will be that of the previous day plus the total amount of transactions for the current day. Unfortunately, due to <a href="http://explainextended.com/2009/11/18/sql-server-are-the-recursive-ctes-really-set-based/">some limitations of <strong>SQL Server</strong>&#8216;s implementation of recursive <strong>CTE</strong>&#8216;s</a>, we can use neither <code>LEFT JOIN</code> nor aggregate functions (like <code>SUM</code>) in a recursive part of the <strong>CTE</strong>.</p>
<p>That&#8217;s why we should use precalculated <code>SUM</code>&#8216;s for the current <code>(ItemID, TranxDate)</code> (which is one of the reasons for pregrouping the records that I mentioned earlier in the article), and retrieve its value using a scalar subquery instead of a <code>LEFT JOIN</code>:</p>
<pre class="brush: sql">
WITH    q AS
        (
        SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
        FROM    [20100122_running].t_transaction
        GROUP BY
                ItemID, TranxDate
        ),
        m AS
        (
        SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
        FROM    (
                SELECT  ItemID, MAX(TranxDate) AS LastDate
                FROM    q
                GROUP BY
                        ItemID
                ) qo
        CROSS APPLY
                (
                SELECT  TOP 1 TranxDate, TranxSum
                FROM    q qi
                WHERE   qi.ItemID = qo.ItemId
                ORDER BY
                        TranxDate
                ) qn
        UNION ALL
        SELECT  ItemId, LastDate, DATEADD(day, 1, m.TranxDate),
                m.TranxSum +
                COALESCE(
                (
                SELECT  TranxSum
                FROM    q
                WHERE   q.ItemID = m.ItemId
                        AND q.TranxDate = DATEADD(day, 1, m.TranxDate)
                ), 0)
        FROM    m
        WHERE   m.TranxDate &lt; m.LastDate
        )
</pre>
<p>Since there are neither a <code>LEFT JOIN</code> nor a <code>SUM</code> in the recursive part of the query, <strong>SQL Server</strong> allows this construct.</p>
<p>If we were to <code>SELECT</code> from this <strong>CTE</strong> we would get a list of running sums for each <code>ItemID</code>.</p>
<h3>Selecting greatest running sums</h3>
<p>All we have to do now is to select records holding a greatest running sum per <code>ItemID</code>.</p>
<p><a href="http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/">This article</a> (which I already mentioned above) describes two methods for doing this in <strong>SQL Server</strong>.</p>
<p>We already used one of these methods, the one which uses <code>DISTINCT / CROSS APPLY</code>. For that task it was better since we needed a <code>GROUP BY</code> anyway to get the <code>LastDate</code>.</p>
<p>However, for getting the records holding greatest running sums, we will use another method, which just filters on partitioned <code>ROW_NUMBER</code>. This method is better, since the query in question is quite a complex <strong>CTE</strong>. <strong>SQL Server</strong> does not usually materialize the <strong>CTE</strong>&#8216;s and reevaluating it each time in a <code>CROSS APPLY</code> would be quite expensive.</p>
<p>The <code>ROW_NUMBER</code>, on the contrary, can be retrieved and filtered on with a single pass over the <strong>CTE</strong>&#8216;s resultset.</p>
<p>Here&#8217;s how we do it:</p>
<pre class="brush: sql">
WITH    q AS
        (
        SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
        FROM    [20100122_running].t_transaction
        GROUP BY
                ItemID, TranxDate
        ),
        m AS
        (
        SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
        FROM    (
                SELECT  ItemID, MAX(TranxDate) AS LastDate
                FROM    q
                GROUP BY
                        ItemID
                ) qo
        CROSS APPLY
                (
                SELECT  TOP 1 TranxDate, TranxSum
                FROM    q qi
                WHERE   qi.ItemID = qo.ItemId
                ORDER BY
                        TranxDate
                ) qn
        UNION ALL
        SELECT  ItemId, LastDate, DATEADD(day, 1, m.TranxDate),
                m.TranxSum +
                COALESCE(
                (
                SELECT  TranxSum
                FROM    q
                WHERE   q.ItemID = m.ItemId
                        AND q.TranxDate = DATEADD(day, 1, m.TranxDate)
                ), 0)
        FROM    m
        WHERE   m.TranxDate &lt; m.LastDate
        )
SELECT  ItemID, TranxDate, TranxSum
FROM    (
        SELECT  m.*, ROW_NUMBER() OVER (PARTITION BY ItemId ORDER BY TranxSum DESC, TranxDate) AS rn
        FROM    m
        ) qp
WHERE   rn = 1
OPTION (MAXRECURSION 0)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>ItemID</th>
<th>TranxDate</th>
<th>TranxSum</th>
</tr>
<tr>
<td class="int">1</td>
<td class="datetime">1999-10-26 00:00:00.000</td>
<td class="decimal">406.0530</td>
</tr>
<tr>
<td class="int">2</td>
<td class="datetime">2007-07-08 00:00:00.000</td>
<td class="decimal">5594.7071</td>
</tr>
<tr>
<td class="int">3</td>
<td class="datetime">2009-06-04 00:00:00.000</td>
<td class="decimal">4903.4597</td>
</tr>
<tr>
<td class="int">4</td>
<td class="datetime">1996-10-14 00:00:00.000</td>
<td class="decimal">22.3472</td>
</tr>
<tr>
<td class="int">5</td>
<td class="datetime">2008-06-18 00:00:00.000</td>
<td class="decimal">247.2859</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0004s (1.0937s)</td>
</tr>
</table>
</div>
<pre>
Table 'Worktable'. Scan count 2, logical reads 149994, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_transaction'. Scan count 49534, logical reads 151549, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1016 ms,  elapsed time = 1101 ms.
</pre>
<pre>
  |--Filter(WHERE:([Expr1029]=(1)))
       |--Sequence Project(DEFINE:([Expr1029]=row_number))
            |--Compute Scalar(DEFINE:([Expr1041]=(1)))
                 |--Segment
                      |--Sort(ORDER BY:([Recr1025] ASC, [Recr1028] DESC, [Recr1027] ASC))
                           |--Index Spool(WITH STACK)
                                |--Concatenation
                                     |--Compute Scalar(DEFINE:([Expr1036]=(0)))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20100122_running].[t_transaction].[ItemID]))
                                     |         |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID]) DEFINE:([Expr1004]=MAX([test].[20100122_running].[t_transaction].[TranxDate])))
                                     |         |    |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID], [test].[20100122_running].[t_transaction].[TranxDate]))
                                     |         |         |--Index Scan(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), ORDERED FORWARD)
                                     |         |--Top(TOP EXPRESSION:((1)))
                                     |              |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[TranxDate]) DEFINE:([Expr1008]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                     |                   |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[test].[20100122_running].[t_transaction].[ItemID]) ORDERED FORWARD)
                                     |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1038], [Recr1009], [Recr1010], [Recr1011], [Recr1012]))
                                          |--Compute Scalar(DEFINE:([Expr1038]=[Expr1037]+(1)))
                                          |    |--Table Spool(WITH STACK)
                                          |--Compute Scalar(DEFINE:([Expr1013]=dateadd(day,(1),[Recr1011]), [Expr1024]=[Recr1012]+CASE WHEN [Expr1017] IS NOT NULL THEN [Expr1022] ELSE ($0.0000) END))
                                               |--Nested Loops(Left Outer Join, PASSTHRU:(IsFalseOrNull [Expr1017] IS NOT NULL))
                                                    |--Nested Loops(Left Outer Join)
                                                    |    |--Filter(WHERE:(STARTUP EXPR([Recr1011]&lt;[Recr1010])))
                                                    |    |    |--Constant Scan
                                                    |    |--Stream Aggregate(DEFINE:([Expr1017]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                                    |         |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD)
                                                    |--Stream Aggregate(DEFINE:([Expr1022]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                                         |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD)
</pre>
<p>The query completes in <strong>1 second</strong> which is very fast.</p>
<p>It performs much better than a cursor-based solution and of course very much better than <code>SELECT</code>-level subqueries or self-joins that calculate the same sums of the same rows over and over and over again.</p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/01/22/sql-server-running-totals/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/01/22/sql-server-running-totals/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/01/22/sql-server-running-totals/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Calculating mode</title>
		<link>http://explainextended.com/2010/01/18/calculating-mode/</link>
		<comments>http://explainextended.com/2010/01/18/calculating-mode/#comments</comments>
		<pubDate>Mon, 18 Jan 2010 20:00:32 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3987</guid>
		<description><![CDATA[From Stack Overflow: I have this query: WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2044314/adding-mode-to-this-sql"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>I have this query:</p>
<pre class="brush: sql">
WITH    CTE AS (
        SELECT  e_id,
                scale,
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
        e_id
</pre>
<p>How do I add Mode to this query?
</p></blockquote>
<p>A quick reminder: in statistics, <a href="http://en.wikipedia.org/wiki/Mode_%28statistics%29">mode</a> is the value that occurs most frequently in a data set.</p>
<p>In other words, for each <code>e_id</code>, mode is the (exact) value of <code>scale</code> shared by most records with this <code>e_id</code>.</p>
<p>Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, <strong>10</strong> records have <code>scale = 1</code> and <strong>10</strong> other records have <code>scale = 2</code> (and all other values of <code>scale</code> are shared by less than <strong>10</strong> records), then there are two modes in this set (and the set, hence, is called <em>bimodal</em>). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.</p>
<p>This means that we should define a way on how to choose this mode.</p>
<p>There can be three approaches to this:</p>
<ol>
<li>Return every modal value</li>
<li>Return a single modal value</li>
<li>Return an aggregate of all modal values</li>
</ol>
<p>To check all queries, we will generate a simple trimodal dataset:<br />
<span id="more-3987"></span><br />
<a href="#" onclick="xcollapse('X7268');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X7268" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20100118_mode]
CREATE TABLE ScoreMaster (
        e_id INT NOT NULL,
        scale DECIMAL(18, 4) NOT NULL
)
GO

WITH    data (e_id, scale) AS
        (
        SELECT  1, 0.1
        UNION ALL
        SELECT  1, 0.2
        UNION ALL
        SELECT  1, 0.2
        UNION ALL
        SELECT  1, 0.3
        UNION ALL
        SELECT  1, 0.3
        UNION ALL
        SELECT  1, 0.4
        UNION ALL
        SELECT  1, 0.4
        UNION ALL
        SELECT  1, 0.5
        )
INSERT
INTO    [20100118_mode].ScoreMaster
SELECT  *
FROM    data
GO
</pre>
</div>
<p>The basic idea is simple: we should find a value that is held by the maximum number or records. To do this, we need to calculate the number of records sharing a given value of <code>scale</code>. We do this by adding a <code>COUNT</code> as an analytical function into the <strong>CTE</strong>:</p>
<pre class="brush: sql">
WITH    cte AS
        (
        SELECT  e_id,
                scale,
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
</pre>
<h3>Single value</h3>
<p>In this case, we just pick a minimal modal value.</p>
<p>To do this, we group the results returned by <code>e_id</code> (just like the original query does) and select the <code>TOP 1 scale</code> ordered by <code>sn DESC, scale</code> in a subquery:</p>
<pre class="brush: sql">
WITH    cte AS
        (
        SELECT  e_id,
                scale,
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        (
        SELECT  TOP 1 scale
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
        ORDER BY
                sn DESC, scale
        )
FROM    cte co
GROUP BY
        e_id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>e_id</th>
<th></th>
</tr>
<tr>
<td class="int">1</td>
<td class="decimal">.2000</td>
</tr>
</table>
</div>
<p>To select the maximal modal value instead of the minimal one, we would need to add a <code>DESC</code> to the <code>score</code> in the <code>ORDER BY</code> clause.</p>
<p>By picking a right final part of the <code>ORDER BY</code> clause, any other value can be returned.</p>
<p>However, the first part of the <code>ORDER BY</code> (<code>sn DESC</code>) should always remain the same, since it&#8217;s what makes a modal value to be selected first.</p>
<h3>Average of all modal values</h3>
<p>To select the average of all modal values, we need to calculate it in the subquery as well.</p>
<p>We can use a very interesting predicate here:</p>
<pre class="brush: sql">
WITH    cte AS
        (
        SELECT  e_id,
                scale,
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        (
        SELECT  AVG(scale)
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
                AND ci.scale = MAX(co.scale)
        )
FROM    cte co
GROUP BY
        e_id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>e_id</th>
<th></th>
</tr>
<tr>
<td class="int">1</td>
<td class="decimal">.500000</td>
</tr>
</table>
</div>
<p>Note this line in the <code>WHERE</code> clause:</p>
<pre class="brush: sql">
                AND ci.scale = MAX(co.scale)
</pre>
<p>This is a naive approach most beginner developers try to use to select records holding a maximal value of the column.</p>
<p>This of course never works for this purpose (because <code>WHERE</code> clause is evaluated before the aggregation). In our case, though, we use <code>MAX</code> as a reference value in the <code>SELECT</code>-level correlated subquery, with the aggregation already performed. So the value of the <code>MAX</code> (taken from results of the other query) can be used in the <code>WHERE</code> clause alright.</p>
<p><code>AVG</code> in the subquery can be replaced by any other aggregate.</p>
<p>If replaced by <code>MIN</code> or <code>MAX</code>, this aggregate variant becomes a synonym for the first query (which selects a single value). However, the first approach is more efficient for that.</p>
<h3>Returning all modes</h3>
<p>To return all modal values, we should use <code>CROSS APPLY</code> clause, and return all <code>DISTINCT</code> scales held by the maximal number of records.</p>
<p>Since it&#8217;s performed before <code>GROUP BY</code>, we will need to wrap the original query into the subquery (so that <code>GROUP BY</code> is performed first). This way, we can also calculate the <code>MAX(sn)</code> to use it in the <code>CROSS APPLY</code> later:</p>
<pre class="brush: sql">
WITH    cte AS
        (
        SELECT  e_id,
                scale,
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id, scale
FROM    (
        SELECT  e_id, MAX(sn) AS msn
        FROM    cte
        GROUP BY
                e_id
        ) co
CROSS APPLY
        (
        SELECT  DISTINCT scale
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
                AND ci.sn = co.msn
        ) ci
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>e_id</th>
<th>scale</th>
</tr>
<tr>
<td class="int">1</td>
<td class="decimal">.2000</td>
</tr>
<tr>
<td class="int">1</td>
<td class="decimal">.3000</td>
</tr>
<tr>
<td class="int">1</td>
<td class="decimal">.4000</td>
</tr>
</table>
</div>
<p>This query, unlike previous two, returns several records per <code>e_id</code>, each holding a modal value of <code>scale</code>.</p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2010/01/18/calculating-mode/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2010/01/18/calculating-mode/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/01/18/calculating-mode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Grouping continuous ranges</title>
		<link>http://explainextended.com/2009/12/30/grouping-continuous-ranges/</link>
		<comments>http://explainextended.com/2009/12/30/grouping-continuous-ranges/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 20:00:32 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3904</guid>
		<description><![CDATA[From Stack Overflow: I have a ticketing system. Now I have to select adjacent places when the user asks for 2 or 3 tickets. Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers. These tickets are in an SQL Server database. [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/1971628/how-to-select-adjacent-seats-seating-plan"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>
I have a ticketing system.</p>
<p>Now I have to select adjacent places when the user asks for <strong>2</strong> or <strong>3</strong> tickets.</p>
<p>Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers.</p>
<p>These tickets are in an <strong>SQL Server</strong> database.</p>
<p>Any ideas about this algorithm to search for available adjacent seats?
</p></blockquote>
<p>This is a problem known as <q>grouping continuous ranges</q>: finding the continuous ranges of the records (in a certain order) having the same value of the grouping column.</p>
<p>For the table in question, the groups will look like this:</p>
<table class="excel">
<caption>Seats</caption>
<tr>
<th>Row</th>
<th>Column</th>
<th>Occupied</th>
<th>Group</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>-</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>3</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>5</td>
<td>1</td>
<td>-</td>
</tr>
<tr>
<td>1</td>
<td>6</td>
<td>1</td>
<td>-</td>
</tr>
<tr>
<td>1</td>
<td>7</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>1</td>
<td>8</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>1</td>
<td>9</td>
<td>1</td>
<td>-</td>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>0</td>
<td>3</td>
</tr>
</table>
<p>To find the spans of free seats with the required length, we need to group the continuous ranges of rows with <code>occupied = 0</code> in <code>column</code> order (within the rows), calculate the counts of records in these groups and return the groups having sufficient value of <code>COUNT(*)</code>.</p>
<p>To group something in <strong>SQL</strong>, we need to have an expression that would have the same value for all records belonging to the group.</p>
<p>How do we build such an expression for the continuous ranges?</p>
<p>In a movie theater, seats are numbered sequentially, so we can rely on the values of the <code>column</code> being continuous. Now, let&#8217;s calculate only the records that describe the free seats, and build their row numbers in the <code>column</code> order:</p>
<table class="excel">
<caption>Seats</caption>
<tr>
<th>Row</th>
<th>Column</th>
<th>Occupied</th>
<th>ROW_NUMBER</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>3</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>7</td>
<td>0</td>
<td>4</td>
</tr>
<tr>
<td>1</td>
<td>8</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>0</td>
<td>6</td>
</tr>
</table>
<p>We see that filtering only free seats broke the column numbering order (the columns numbers are not continuous anymore), but the <code>ROW_NUMBER</code>s are continuous (by definition). Each occupied seat breaks the continuity of the columns, and the spans of the free seats correspond to the ranges of columns with unbroken continuity.</p>
<p>If the continuity is not broken, <code>column</code> and <code>ROW_NUMBER</code> both increase by <strong>1</strong> with each record. This means that their difference will be constant:</p>
<table class="excel">
<caption>Seats</caption>
<tr>
<th>Row</th>
<th>Column</th>
<th>Occupied</th>
<th>ROW_NUMBER</th>
<th>column &#8211; ROW_NUMBER</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>3</td>
<td>0</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>0</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>7</td>
<td>0</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>8</td>
<td>0</td>
<td>5</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>0</td>
<td>6</td>
<td>4</td>
</tr>
</table>
<p>The value of difference between <code>column</code> and <code>ROW_NUMBER</code>, therefore, uniquely defines the continuous group the records belong to.</p>
<p>This means we can group on it and do other things (the rest being pure technical).</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-3904"></span><br />
<a href="#" onclick="xcollapse('X10496');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X10496" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20091230_seats]
CREATE TABLE t_seat (
      row INT NOT NULL,
      col INT NOT NULL,
      occupied BIT NOT NULL,
      CONSTRAINT pk_seat_row_col
      PRIMARY KEY  (row, col)
)
GO
BEGIN TRANSACTION
SELECT  RAND(20091230)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 100
BEGIN
        INSERT
        INTO    [20091230_seats].t_seat (row, col, occupied)
        VALUES  (
                (@cnt - 1) / 10 + 1,
                (@cnt - 1) % 10 + 1,
                ROUND(RAND(), 0)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
</div>
<p>and select the chart of the places:</p>
<pre class="brush: sql">
SELECT  row, [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]
FROM    (
        SELECT  row, col,
                CASE occupied WHEN 0 THEN &#039; &#039; ELSE &#039;X&#039; END AS occupied
        FROM    [20091230_seats].t_seat
        ) q
PIVOT   (
        MAX(occupied)
        FOR col IN
        ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
        ) p
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>row</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
<tr>
<td class="int">1</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">2</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">3</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">4</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">5</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">6</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">7</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">8</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">9</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">10</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0011s (0.0012s)</td>
</tr>
</table>
</div>
<p>Now, let&#8217;s select the seat spans having length <strong>5</strong> or more:</p>
<pre class="brush: sql">
WITH    free AS
        (
        SELECT  row, col, ROW_NUMBER() OVER (PARTITION BY row ORDER BY col) AS rn
        FROM    [20091230_seats].t_seat
        WHERE   occupied = 0
        ),
        spans AS
        (
        SELECT  row, col, COUNT(*) OVER (PARTITION BY row, col - rn) AS rl
        FROM    free
        )
SELECT  *
FROM    spans
WHERE   rl &gt;= 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>row</th>
<th>col</th>
<th>rl</th>
</tr>
<tr>
<td class="int">5</td>
<td class="int">2</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">3</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">4</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">5</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">6</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">7</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">4</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">5</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">6</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">7</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">8</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">9</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">10</td>
<td class="int">7</td>
</tr>
<tr class="statusbar">
<td colspan="100">13 rows fetched in 0.0005s (0.0016s)</td>
</tr>
</table>
</div>
<p>and put them on the chart:</p>
<pre class="brush: sql">
WITH    free AS
        (
        SELECT  row, col, ROW_NUMBER() OVER (PARTITION BY row ORDER BY col) AS rn
        FROM    [20091230_seats].t_seat
        WHERE   occupied = 0
        ),
        spans AS
        (
        SELECT  row, col, COUNT(*) OVER (PARTITION BY row, col - rn) AS rl
        FROM    free
        )
SELECT  row, [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]
FROM    (
        SELECT  se.row, se.col,
                CASE
                WHEN rl IS NULL THEN
                        &#039;X&#039;
                WHEN rl &lt; 5 THEN
                        &#039; &#039;
                ELSE
                        CAST(rl AS VARCHAR)
                END AS occupied
        FROM    [20091230_seats].t_seat se
        LEFT JOIN
                spans sp
        ON      sp.row = se.row
                AND sp.col = se.col
        ) q
PIVOT   (
        MAX(occupied)
        FOR col IN
        ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
        ) p
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>row</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
<tr>
<td class="int">1</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">2</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">3</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">4</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">5</td>
<td class="varchar">X</td>
<td class="varchar">6</td>
<td class="varchar">6</td>
<td class="varchar">6</td>
<td class="varchar">6</td>
<td class="varchar">6</td>
<td class="varchar">6</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">6</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
</tr>
<tr>
<td class="int">7</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
<td class="varchar">7</td>
</tr>
<tr>
<td class="int">8</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">9</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr>
<td class="int">10</td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar">X</td>
<td class="varchar"> </td>
<td class="varchar"> </td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0012s (0.0098s)</td>
</tr>
</table>
</div>
<p>Free spans of <strong>5</strong> seats or more (in rows <strong>5</strong> and <strong>7</strong>) are marked on the chart with the numbers corresponding to the number of free seats in the spans.</p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2009/12/30/grouping-continuous-ranges/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2009/12/30/grouping-continuous-ranges/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/12/30/grouping-continuous-ranges/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SQL Server: Selecting records holding group-wise maximum (with ties)</title>
		<link>http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/</link>
		<comments>http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/#comments</comments>
		<pubDate>Tue, 01 Dec 2009 20:00:58 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3822</guid>
		<description><![CDATA[A feedback on the yesterday&#8217;s article on selecting records holding group-wise maximums in SQL Server. Gonçalo asks: Regarding the post I mention on the subject, wouldn&#8217;t it be easier to obtain the result you&#8217;re after using the SQL Server specific SELECT TOP x WITH TIES? This clause (which is indeed specific to SQL Server) returns [...]]]></description>
			<content:encoded><![CDATA[<p>A feedback on the yesterday&#8217;s article on <a href="/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/">selecting records holding group-wise maximums in <strong>SQL Server</strong></a>.</p>
<p><strong>Gonçalo</strong> asks:</p>
<blockquote><p>Regarding the post I mention on the subject, wouldn&#8217;t it be easier to obtain the result you&#8217;re after using the <strong>SQL Server</strong> specific <code>SELECT TOP x WITH TIES</code>?
</p></blockquote>
<p>This clause (which is indeed specific to <strong>SQL Server</strong>) returns <em>ties</em>. <code>TOP x WITH TIES</code> means return <code>TOP x</code> records, plus all records holding the same values of the <code>ORDER BY</code> expressions as the last (<code>x</code>&#8216;th) record returned by the <code>TOP</code>.</p>
<p>With a little effort this can be used for one of the queries I wrote about yesterday, namely the one that returns all records holding <code>MIN(orderer)</code> within each group.</p>
<p>However, the efficiency of this solution varies greatly depending on the cardinality of both ordering and grouping columns.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-3822"></span><br />
<a href="#" onclick="xcollapse('X7848');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X7848" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20091201_ties]
CREATE TABLE t_distinct (
      id INT NOT NULL PRIMARY KEY,
      orderer INT NOT NULL,
      lorderer INT NOT NULL,
      glow INT NOT NULL,
      ghigh INT NOT NULL,
      stuffing VARCHAR(200) NOT NULL
)
GO
CREATE INDEX ix_distinct_glow_id ON [20091201_ties].t_distinct (glow);
CREATE INDEX ix_distinct_ghigh_id ON [20091201_ties].t_distinct (ghigh);
CREATE INDEX ix_distinct_glow_orderer_id ON [20091201_ties].t_distinct (glow, orderer);
CREATE INDEX ix_distinct_ghigh_orderer_id ON [20091201_ties].t_distinct (ghigh, orderer);
CREATE INDEX ix_distinct_glow_lorderer_id ON [20091201_ties].t_distinct (glow, lorderer);
CREATE INDEX ix_distinct_ghigh_lorderer_id ON [20091201_ties].t_distinct (ghigh, lorderer);
BEGIN TRANSACTION
SELECT  RAND(20091201)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 1000000
BEGIN
        INSERT
        INTO    [20091201_ties].t_distinct (id, orderer, lorderer, glow, ghigh, stuffing)
        VALUES  (
                @cnt,
                FLOOR(RAND() * 10) + 1,
                FLOOR(RAND() * 10000) + 1,
                (@cnt - 1) % 10 + 1,
                (@cnt - 1) % 10000 + 1,
                REPLICATE(&#039;*&#039;, 200)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
</div>
<p>Just like in the articles before that this table contains <code>1,000,000</code> records with the following fields:</p>
<ul>
<li><code>id</code> is the <code>PRIMARY KEY</code></li>
<li><code>orderer</code> is filled with random values from <strong>1</strong> to <strong>10</strong></li>
<li><code>glow</code> is a low cardinality grouping field (<strong>10</strong> distinct values)</li>
<li><code>ghigh</code> is a high cardinality grouping field (<strong>10,000</strong> distinct values)</li>
<li><code>stuffing</code> is an asterisk-filled <code>VARCHAR(200)</code> column added to emulate payload of the actual tables</li>
</ul>
<p>Additionally, there is one more field: <code>lorderer</code>. Just like <code>orderer</code>, it is filled with random values, but has much greater cardinality: the values are <strong>1</strong> to <strong>10,000</strong>.</p>
<p>Now, let&#8217;s try to select the records holding <code>MIN(orderer)</code> using the two approaches.</p>
<h3>Low cardinality orderers</h3>
<hr/>
<h4>Using DENSE_RANK</h4>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
        SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
        FROM    [20091201_ties].t_distinct d
        ) dd
WHERE   dr = 1
</pre>
<p><a href="#" onclick="xcollapse('X1668');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1668" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">99558</td>
<td class="int">19911600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0000s (41.0302s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 3, logical reads 34423, physical reads 141, read-ahead reads 31072, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 5765 ms,  elapsed time = 41031 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1007],0), [Expr1004]=CASE WHEN [globalagg1009]=(0) THEN NULL ELSE [globalagg1011] END))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=COUNT_BIG([Expr1005]), [partialagg1010]=SUM([Expr1005])))
                      |--Filter(WHERE:([Expr1002]=(1)))
                           |--Compute Scalar(DEFINE:([Expr1005]=len([test].[20091201_ties].[t_distinct].[stuffing] as [d].[stuffing])))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1002]=dense_rank))
                                          |--Segment
                                               |--Segment
                                                    |--Parallelism(Gather Streams, ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                         |--Sort(ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                              |--Clustered Index Scan(OBJECT:([test].[20091201_ties].[t_distinct].[PK__t_distinct__7EB7AD3A] AS [d]))
</pre>
</div>
<p>As in the yesterday&#8217;s table, this query runs for <strong>41</strong> second. The plan used by the engine involves partitioning, sorting and gathering the streams, but the real problem is that the table is large and a fullscan is used. Just too much data to be read.</p>
<p>Now, let&#8217;s try the same query using <code>CROSS APPLY</code> / <code>TOP … WITH TIES</code>.</p>
<h4>Using CROSS APPLY / TOP … WITH TIES</h4>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
        SELECT  DISTINCT glow
        FROM    [20091201_ties].t_distinct d
        ) dd
CROSS APPLY
        (
        SELECT  TOP 1 WITH TIES
                stuffing
        FROM    [20091201_ties].t_distinct di
        WHERE   di.glow = dd.glow
        ORDER BY
                orderer
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X984');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X984" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">99558</td>
<td class="int">19911600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0000s (116.9189s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 11, logical reads 456326, physical reads 1281, read-ahead reads 67626, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1687 ms,  elapsed time = 116924 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1015],0), [Expr1005]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |--Stream Aggregate(DEFINE:([Expr1015]=Count(*), [Expr1016]=COUNT_BIG([Expr1006]), [Expr1017]=SUM([Expr1006])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
                 |--Stream Aggregate(GROUP BY:([d].[glow]))
                 |    |--Index Scan(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
                 |--Top(TOP EXPRESSION:((1)))
                      |--Compute Scalar(DEFINE:([Expr1006]=len([test].[20091201_ties].[t_distinct].[stuffing] as [di].[stuffing])))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[id], [Expr1014]) WITH ORDERED PREFETCH)
                                |--Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[test].[20091201_ties].[t_distinct].[glow] as [d].[glow]) ORDERED FORWARD)
                                |--Clustered Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[PK__t_distinct__7EB7AD3A] AS [di]), SEEK:([di].[id]=[test].[20091201_ties].[t_distinct].[id] as [di].[id]) LOOKUP ORDERED FORWARD)
</pre>
</div>
<p>This query runs for… um, <strong>117</strong> seconds. Thrice as slow.</p>
<p>Again: there are too many records. The <code>TOP</code> uses the index to return first rows, which is normally fast but not when there are <strong>10,000</strong> records to return. Using the index implies the need to lookup the values of <code>stuffing</code> (which we use in this query) in the actual table. You can see it in the plan above as a <code>Nested Loops</code> branch which joins the index (<code>Index Seek</code>) and the actual table (<code>Clustered Index Seek</code>). The overhead required for joining the index and the table by far outweighs all benefits of having preordered data.</p>
<p>However, this is only true for the orderers with low cardinality.</p>
<h3>High cardinality orderers</h3>
<hr/>
<p>Let&#8217;s try the same approaches but this time let&#8217;s use them to return the records holding <code>MIN(lorderer)</code>. This field has much higher cardinality and there are few records per group that hold the <code>MIN(lorderer)</code> value.</p>
<h4>Using DENSE_RANK</h4>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
        SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY lorderer) AS dr
        FROM    [20091201_ties].t_distinct d
        ) dd
WHERE   dr = 1
</pre>
<p><a href="#" onclick="xcollapse('X2251');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X2251" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">103</td>
<td class="int">20600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0000s (40.7021s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 3, logical reads 34423, physical reads 32, read-ahead reads 26743, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 6249 ms,  elapsed time = 40703 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1007],0), [Expr1004]=CASE WHEN [globalagg1009]=(0) THEN NULL ELSE [globalagg1011] END))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=COUNT_BIG([Expr1005]), [partialagg1010]=SUM([Expr1005])))
                      |--Filter(WHERE:([Expr1002]=(1)))
                           |--Compute Scalar(DEFINE:([Expr1005]=len([test].[20091201_ties].[t_distinct].[stuffing] as [d].[stuffing])))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1002]=dense_rank))
                                          |--Segment
                                               |--Segment
                                                    |--Parallelism(Gather Streams, ORDER BY:([d].[glow] ASC, [d].[lorderer] ASC))
                                                         |--Sort(ORDER BY:([d].[glow] ASC, [d].[lorderer] ASC))
                                                              |--Clustered Index Scan(OBJECT:([test].[20091201_ties].[t_distinct].[PK__t_distinct__7EB7AD3A] AS [d]))
</pre>
</div>
<p><strong>40</strong> seconds.</p>
<p>The performance is just as poor. Since we have exactly same full table scan with the exactly same million rows to sort, it&#8217;s quite predictable.</p>
<h4>Using CROSS APPLY / TOP … WITH TIES</h4>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
        SELECT  DISTINCT glow
        FROM    [20091201_ties].t_distinct d
        ) dd
CROSS APPLY
        (
        SELECT  TOP 1 WITH TIES
                stuffing
        FROM    [20091201_ties].t_distinct di
        WHERE   di.glow = dd.glow
        ORDER BY
                lorderer
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X2736');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X2736" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">103</td>
<td class="int">20600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.2499s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 11, logical reads 3506, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 250 ms,  elapsed time = 250 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1015],0), [Expr1005]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |--Stream Aggregate(DEFINE:([Expr1015]=Count(*), [Expr1016]=COUNT_BIG([Expr1006]), [Expr1017]=SUM([Expr1006])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
                 |--Stream Aggregate(GROUP BY:([d].[glow]))
                 |    |--Index Scan(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
                 |--Top(TOP EXPRESSION:((1)))
                      |--Compute Scalar(DEFINE:([Expr1006]=len([test].[20091201_ties].[t_distinct].[stuffing] as [di].[stuffing])))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[id], [Expr1014]) WITH ORDERED PREFETCH)
                                |--Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_lorderer_id] AS [di]), SEEK:([di].[glow]=[test].[20091201_ties].[t_distinct].[glow] as [d].[glow]) ORDERED FORWARD)
                                |--Clustered Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[PK__t_distinct__7EB7AD3A] AS [di]), SEEK:([di].[id]=[test].[20091201_ties].[t_distinct].[id] as [di].[id]) LOOKUP ORDERED FORWARD)
</pre>
</div>
<p>Now, it&#8217;s completely other story. With this ordering condition, we have <strong>10</strong> ties per group (instead of <strong>10,000</strong>). The index is a real benefit now, and the query completes in only <strong>250 ms</strong>.</p>
<p>If we replace <code>DISTINCT</code> with a recursive <strong>CTE</strong> (the approach described in the previous article), the query can be improved yet a little more:</p>
<pre class="brush: sql">
WITH    rows AS
        (
        SELECT  MIN(glow) AS glow
        FROM    [20091201_ties].t_distinct
        UNION ALL
        SELECT  glow
        FROM    (
                SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
                FROM    rows r
                JOIN    [20091130_groupwise].t_distinct di
                ON      di.glow &gt; r.glow
                WHERE   r.glow IS NOT NULL
                ) q
        WHERE   rn = 1
        )
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    rows r
CROSS APPLY
        (
        SELECT  TOP 1 WITH TIES
                stuffing
        FROM    [20091201_ties].t_distinct di
        WHERE   di.glow = r.glow
        ORDER BY
                lorderer
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X8091');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8091" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">103</td>
<td class="int">20600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.0051s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 11, logical reads 388, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 62, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_distinct'. Scan count 10, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 5 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1011]=CONVERT_IMPLICIT(int,[Expr1030],0), [Expr1012]=CASE WHEN [Expr1031]=(0) THEN NULL ELSE [Expr1032] END))
       |--Stream Aggregate(DEFINE:([Expr1030]=Count(*), [Expr1031]=COUNT_BIG([Expr1014]), [Expr1032]=SUM([Expr1014])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
                 |--Index Spool(WITH STACK)
                 |    |--Concatenation
                 |         |--Compute Scalar(DEFINE:([Expr1025]=(0)))
                 |         |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091201_ties].[t_distinct].[glow])))
                 |         |         |--Top(TOP EXPRESSION:((1)))
                 |         |              |--Index Scan(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
                 |         |--Assert(WHERE:(CASE WHEN [Expr1027]&gt;(100) THEN (0) ELSE NULL END))
                 |              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1027], [Recr1004]))
                 |                   |--Compute Scalar(DEFINE:([Expr1027]=[Expr1026]+(1)))
                 |                   |    |--Table Spool(WITH STACK)
                 |                   |--Filter(WHERE:([Expr1007]=(1)))
                 |                        |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)&lt;(0) THEN (0) ELSE (1) END))
                 |                             |--Sequence Project(DEFINE:([Expr1007]=row_number))
                 |                                  |--Compute Scalar(DEFINE:([Expr1024]=(1)))
                 |                                       |--Segment
                 |                                            |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
                 |                                                 |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] &gt; [Recr1004]) ORDERED FORWARD)
                 |--Top(TOP EXPRESSION:((1)))
                      |--Compute Scalar(DEFINE:([Expr1014]=len([test].[20091201_ties].[t_distinct].[stuffing] as [di].[stuffing])))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[id], [Expr1029]) WITH ORDERED PREFETCH)
                                |--Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[ix_distinct_glow_lorderer_id] AS [di]), SEEK:([di].[glow]=[Recr1008]) ORDERED FORWARD)
                                |--Clustered Index Seek(OBJECT:([test].[20091201_ties].[t_distinct].[PK__t_distinct__7EB7AD3A] AS [di]), SEEK:([di].[id]=[test].[20091201_ties].[t_distinct].[id] as [di].[id]) LOOKUP ORDERED FORWARD)
</pre>
</div>
<p><strong>5 ms</strong>.</p>
<p>Now that&#8217;s what I call performance.</p>
<h3>Summary</h3>
<hr/>
<p>To retrieve all records holding group-wise minimums or maximums (including ties), two approaches can be used in <strong>SQL Server</strong>:</p>
<ol>
<li>Use <code>DENSE_RANK</code> partitioned by the grouper and return the rows holding the dense rank of <strong>1</strong></li>
<li>Use <code>DISTINCT</code> or a recursive <strong>CTE</strong> to return a list of groupers and <code>CROSS APPLY</code> a query with <code>TOP 1 WITH TIES</code> to each grouper</li>
</ol>
<p>The first method requires sorting on <code>(grouper, orderer)</code>, while the second one requires index scan (and implied index lookups) for each grouper.</p>
<p>Sorting the large amounts of data is more efficient than using an index lookup to retrieve them. However, index scan processes only the portion of data (within the tie) while sorting requires them all.</p>
<p>For low cardinality orderers (the values inside the <code>MIN</code> and <code>MAX</code>), too much records comprise a tie, and index scan is less efficient than a table scan with sorting. For these orderers, <code>DENSE_RANK</code> solution is more efficient.</p>
<p>For high cardinality orderers, the index scan returns only a small portion of records and it&#8217;s almost instant. For these orderers, a <code>CROSS APPLY</code> / <code>TOP 1 WITH TIES</code> is more efficient.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SQL Server: Selecting records holding group-wise maximum</title>
		<link>http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/</link>
		<comments>http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/#comments</comments>
		<pubDate>Mon, 30 Nov 2009 20:00:11 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3805</guid>
		<description><![CDATA[Continuing the series on selecting records holding group-wise maximums: How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column? In this article, I&#8217;ll consider SQL Server. SQL Server has very rich SQL syntax and its optimizer is really powerful. However, some tricks are still required [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing the series on <a href="/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/">selecting records holding group-wise maximums</a>:</p>
<blockquote><p>How do I select the <em>whole</em> records, grouped on <code>grouper</code> and holding a group-wise maximum (or minimum) on other column?</p></blockquote>
<p>In this article, I&#8217;ll consider <strong>SQL Server</strong>.</p>
<p><strong>SQL Server</strong> has very rich <strong>SQL</strong> syntax and its optimizer is really powerful.</p>
<p>However, some tricks are still required to make the queries like this to run faster.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-3805"></span><br />
<a href="#" onclick="xcollapse('X4790');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X4790" style="display: none; ">
<pre class="brush: sql">
CREATE SCHEMA [20091130_groupwise]
CREATE TABLE t_distinct (
      id INT NOT NULL PRIMARY KEY,
      orderer INT NOT NULL,
      glow INT NOT NULL,
      ghigh INT NOT NULL,
      stuffing VARCHAR(200) NOT NULL
)
GO
CREATE INDEX ix_distinct_glow_id ON [20091130_groupwise].t_distinct (glow, id);
CREATE INDEX ix_distinct_ghigh_id ON [20091130_groupwise].t_distinct (ghigh, id);
CREATE INDEX ix_distinct_glow_orderer_id ON [20091130_groupwise].t_distinct (glow, orderer, id);
CREATE INDEX ix_distinct_ghigh_orderer_id ON [20091130_groupwise].t_distinct (ghigh, orderer, id);
BEGIN TRANSACTION
SELECT  RAND(20091130)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt &lt;= 1000000
BEGIN
        INSERT
        INTO    [20091130_groupwise].t_distinct (id, orderer, glow, ghigh, stuffing)
        VALUES  (
                @cnt,
                FLOOR(RAND() * 9) + 1,
                (@cnt - 1) % 10 + 1,
                (@cnt - 1) % 10000 + 1,
                REPLICATE(&#039;*&#039;, 200)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
</pre>
</div>
<p>As in the previous articles, this table has <strong>1,000,000</strong> records:</p>
<ul>
<li><code>id</code> is the <code>PRIMARY KEY</code></li>
<li><code>orderer</code> is filled with random values from <strong>1</strong> to <strong>10</strong></li>
<li><code>glow</code> is a low cardinality grouping field (<strong>10</strong> distinct values)</li>
<li><code>ghigh</code> is a high cardinality grouping field (<strong>10,000</strong> distinct values)</li>
<li><code>stuffing</code> is an asterisk-filled <code>VARCHAR(200)</code> column added to emulate payload of the actual tables</li>
</ul>
<p>Now, let&#8217;s make a couple of queries.</p>
<h3>Analytic functions</h3>
<p>Just like <strong>PostgreSQL 8.4</strong> that we considered in the <a href="/2009/11/26/postgresql-selecting-records-holding-group-wise-maximum/">previous article</a>, <strong>SQL Server</strong> implements window (analytic) functions.</p>
<p>We can just copy the queries from the previous articles and see the results.</p>
<h4>Unique rows (no ties)</h4>
<pre class="brush: sql">
SELECT  id, orderer, glow, ghigh
FROM    (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn
        FROM    [20091130_groupwise].t_distinct
        ) q
WHERE   rn = 1
</pre>
<p><a href="#" onclick="xcollapse('X2543');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X2543" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">1</td>
<td class="int">1</td>
</tr>
<tr>
<td class="int">2</td>
<td class="int">9</td>
<td class="int">2</td>
<td class="int">2</td>
</tr>
<tr>
<td class="int">3</td>
<td class="int">2</td>
<td class="int">3</td>
<td class="int">3</td>
</tr>
<tr>
<td class="int">4</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">4</td>
</tr>
<tr>
<td class="int">5</td>
<td class="int">9</td>
<td class="int">5</td>
<td class="int">5</td>
</tr>
<tr>
<td class="int">6</td>
<td class="int">8</td>
<td class="int">6</td>
<td class="int">6</td>
</tr>
<tr>
<td class="int">7</td>
<td class="int">5</td>
<td class="int">7</td>
<td class="int">7</td>
</tr>
<tr>
<td class="int">8</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">8</td>
</tr>
<tr>
<td class="int">9</td>
<td class="int">3</td>
<td class="int">9</td>
<td class="int">9</td>
</tr>
<tr>
<td class="int">10</td>
<td class="int">6</td>
<td class="int">10</td>
<td class="int">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (10.3436s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 42, read-ahead reads 23927, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 3593 ms,  elapsed time = 10342 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Filter(WHERE:([Expr1003]=(1)))
            |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                 |--Sequence Project(DEFINE:([Expr1003]=row_number))
                      |--Compute Scalar(DEFINE:([Expr1005]=(1)))
                           |--Segment
                                |--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
                                     |--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[id] ASC))
                                          |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))
</pre>
</div>
<h4>Returning all ties</h4>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(LEN(stuffing)) AS psum
FROM    (
        SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
        FROM    [20091130_groupwise].t_distinct d
        ) dd
WHERE   dr = 1
</pre>
<p><a href="#" onclick="xcollapse('X3403');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X3403" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>psum</th>
</tr>
<tr>
<td class="int">111013</td>
<td class="int">22202600</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0000s (39.6242s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 188, read-ahead reads 22053, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 6704 ms,  elapsed time = 39616 ms.
</pre>
<pre>
  |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1007],0), [Expr1004]=CASE WHEN [globalagg1009]=(0) THEN NULL ELSE [globalagg1011] END))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=COUNT_BIG([Expr1005]), [partialagg1010]=SUM([Expr1005])))
                      |--Filter(WHERE:([Expr1002]=(1)))
                           |--Compute Scalar(DEFINE:([Expr1005]=len([test].[20091130_groupwise].[t_distinct].[stuffing] as [d].[stuffing])))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1002]=dense_rank))
                                          |--Segment
                                               |--Segment
                                                    |--Parallelism(Gather Streams, ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                         |--Sort(ORDER BY:([d].[glow] ASC, [d].[orderer] ASC))
                                                              |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]))
</pre>
</div>
<h4>Resolving ties by returning the <code>MAX(id)</code></h4>
<pre class="brush: sql">
SELECT  id, orderer, glow, ghigh
FROM    (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY orderer, id DESC) AS rn
        FROM    [20091130_groupwise].t_distinct
        ) dd
WHERE   rn = 1
</pre>
<p><a href="#" onclick="xcollapse('X4575');return false;"><strong>View query details</strong></a><br />
<br/></p>
<div id="X4575" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999951</td>
<td class="int">1</td>
<td class="int">1</td>
<td class="int">9951</td>
</tr>
<tr>
<td class="int">999962</td>
<td class="int">1</td>
<td class="int">2</td>
<td class="int">9962</td>
</tr>
<tr>
<td class="int">999933</td>
<td class="int">1</td>
<td class="int">3</td>
<td class="int">9933</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999965</td>
<td class="int">1</td>
<td class="int">5</td>
<td class="int">9965</td>
</tr>
<tr>
<td class="int">999906</td>
<td class="int">1</td>
<td class="int">6</td>
<td class="int">9906</td>
</tr>
<tr>
<td class="int">999987</td>
<td class="int">1</td>
<td class="int">7</td>
<td class="int">9987</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999969</td>
<td class="int">1</td>
<td class="int">9</td>
<td class="int">9969</td>
</tr>
<tr>
<td class="int">999960</td>
<td class="int">1</td>
<td class="int">10</td>
<td class="int">9960</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (9.1873s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 3, logical reads 33381, physical reads 41, read-ahead reads 22122, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 4780 ms,  elapsed time = 9188 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Filter(WHERE:([Expr1003]=(1)))
            |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                 |--Sequence Project(DEFINE:([Expr1003]=row_number))
                      |--Compute Scalar(DEFINE:([Expr1005]=(1)))
                           |--Segment
                                |--Parallelism(Gather Streams, ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
                                     |--Sort(ORDER BY:([test].[20091130_groupwise].[t_distinct].[glow] ASC, [test].[20091130_groupwise].[t_distinct].[orderer] ASC, [test].[20091130_groupwise].[t_distinct].[id] DESC))
                                          |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8]))
</pre>
</div>
<p>Very elegant and <em>very</em> slow, <strong>9</strong> to <strong>38</strong> (!) seconds.</p>
<p>In all cases, <strong>SQL Server</strong> uses a clustered index scan. The table is too large to fit into the cache. Many physical reads (including read-ahead reads) are required by the query.</p>
<h3>DISTINCT / CROSS APPLY</h3>
<p>Unlike <strong>PostgreSQL</strong> and <strong>MySQL</strong>, <strong>SQL Server</strong> does not implement any native method for returning a singe record out of a group. However, <strong>SQL Server</strong> does implement a very useful clause, <code>CROSS APPLY</code>, which allows taking the records from the main query and <i>applying</i> their results to a subquery, where they can be referenced. Unlike set-based <code>JOIN</code>, <code>CROSS APPLY</code> is record-based and clearly distinguishes between the leading and the driven tables. A driven subquery can use the values from the leading query not only in the <code>WHERE</code> clause but also in the <code>TOP</code> clause, can have its own <code>DISTINCT</code> and <code>ORDER BY</code> clauses etc.</p>
<p><code>CROSS APPLY</code> can improve the query used in our task as well. We should just take the list of the distinct groupers and cross apply the query which would return a first record for each grouper (resolving ties as necessary).</p>
<h4>Unique rows (no ties)</h4>
<pre class="brush: sql">
SELECT  ds.*
FROM    (
        SELECT  DISTINCT glow
        FROM    [20091130_groupwise].t_distinct d
        ) dd
CROSS APPLY
        (
        SELECT  TOP 1 id, orderer, glow, ghigh
        FROM    [20091130_groupwise].t_distinct di
        WHERE   di.glow = dd.glow
        ORDER BY
                id
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X4362');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X4362" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999991</td>
<td class="int">2</td>
<td class="int">1</td>
<td class="int">9991</td>
</tr>
<tr>
<td class="int">999992</td>
<td class="int">3</td>
<td class="int">2</td>
<td class="int">9992</td>
</tr>
<tr>
<td class="int">999993</td>
<td class="int">8</td>
<td class="int">3</td>
<td class="int">9993</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999995</td>
<td class="int">3</td>
<td class="int">5</td>
<td class="int">9995</td>
</tr>
<tr>
<td class="int">999996</td>
<td class="int">9</td>
<td class="int">6</td>
<td class="int">9996</td>
</tr>
<tr>
<td class="int">999997</td>
<td class="int">4</td>
<td class="int">7</td>
<td class="int">9997</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999999</td>
<td class="int">9</td>
<td class="int">9</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">1000000</td>
<td class="int">6</td>
<td class="int">10</td>
<td class="int">10000</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (0.2460s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 11, logical reads 3160, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 234 ms,  elapsed time = 246 ms.
</pre>
<pre>
  |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
       |--Stream Aggregate(GROUP BY:([d].[glow]))
       |    |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
       |--Top(TOP EXPRESSION:((1)))
            |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]),  WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED BACKWARD)
</pre>
</div>
<p>This is quite fast, <strong>250 ms</strong>.</p>
<h4>Resolving ties by returning the <code>MAX(id)</code> within the <code>MIN(orderer)</code></h4>
<p>Since the <code>orderer</code> and the <code>id</code> are ordered differently in this query, a single <code>CROSS APPLY</code> would not be able to use the index efficiently:</p>
<pre class="brush: sql">
SELECT  ds.*
FROM    (
        SELECT  DISTINCT glow
        FROM    [20091130_groupwise].t_distinct d
        ) dd
CROSS APPLY
        (
        SELECT  TOP 1 id, orderer, glow, ghigh
        FROM    [20091130_groupwise].t_distinct di
        WHERE   di.glow = dd.glow
        ORDER BY
                orderer, id DESC
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X4355');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X4355" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999965</td>
<td class="int">1</td>
<td class="int">5</td>
<td class="int">9965</td>
</tr>
<tr>
<td class="int">999906</td>
<td class="int">1</td>
<td class="int">6</td>
<td class="int">9906</td>
</tr>
<tr>
<td class="int">999987</td>
<td class="int">1</td>
<td class="int">7</td>
<td class="int">9987</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999951</td>
<td class="int">1</td>
<td class="int">1</td>
<td class="int">9951</td>
</tr>
<tr>
<td class="int">999962</td>
<td class="int">1</td>
<td class="int">2</td>
<td class="int">9962</td>
</tr>
<tr>
<td class="int">999933</td>
<td class="int">1</td>
<td class="int">3</td>
<td class="int">9933</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999969</td>
<td class="int">1</td>
<td class="int">9</td>
<td class="int">9969</td>
</tr>
<tr>
<td class="int">999960</td>
<td class="int">1</td>
<td class="int">10</td>
<td class="int">9960</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0006s (15.9986s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 4, logical reads 33862, physical reads 42, read-ahead reads 27281, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 10, logical reads 2959407, physical reads 3114, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 6328 ms,  elapsed time = 16010 ms.
</pre>
<pre>
  |--Parallelism(Gather Streams)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
            |--Stream Aggregate(GROUP BY:([d].[glow]))
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([d].[glow]), ORDER BY:([d].[glow] ASC))
            |         |--Stream Aggregate(GROUP BY:([d].[glow]))
            |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
            |--Sort(TOP 1, ORDER BY:([di].[orderer] ASC, [di].[id] DESC))
                 |--Index Spool(SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]))
                      |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [di]))
</pre>
</div>
<p>This query has to sort to find the top record. Since there are too much records to sort, it&#8217;s very inefficient, <strong>16</strong> seconds.</p>
<p>To benefit from the index, we should use a <code>CROSS APPLY</code> twice: to select the <code>MIN(orderer)</code> and to select the <code>MAX(id)</code> within this <code>orderer</code>:</p>
<pre class="brush: sql">
SELECT  ds.*
FROM    (
        SELECT  DISTINCT glow
        FROM    [20091130_groupwise].t_distinct d
        ) dd
CROSS APPLY
        (
        SELECT  TOP 1 dss.*
        FROM    (
                SELECT  TOP 1 orderer
                FROM    [20091130_groupwise].t_distinct di
                WHERE   di.glow = dd.glow
                ) ddd
        CROSS APPLY
                (
                SELECT  TOP 1 id, orderer, glow, ghigh
                FROM    [20091130_groupwise].t_distinct dii
                WHERE   dii.glow = dd.glow
                        AND dii.orderer = ddd.orderer
                ORDER BY
                        id DESC
                ) dss
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X281');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X281" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999951</td>
<td class="int">1</td>
<td class="int">1</td>
<td class="int">9951</td>
</tr>
<tr>
<td class="int">999962</td>
<td class="int">1</td>
<td class="int">2</td>
<td class="int">9962</td>
</tr>
<tr>
<td class="int">999933</td>
<td class="int">1</td>
<td class="int">3</td>
<td class="int">9933</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999965</td>
<td class="int">1</td>
<td class="int">5</td>
<td class="int">9965</td>
</tr>
<tr>
<td class="int">999906</td>
<td class="int">1</td>
<td class="int">6</td>
<td class="int">9906</td>
</tr>
<tr>
<td class="int">999987</td>
<td class="int">1</td>
<td class="int">7</td>
<td class="int">9987</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999969</td>
<td class="int">1</td>
<td class="int">9</td>
<td class="int">9969</td>
</tr>
<tr>
<td class="int">999960</td>
<td class="int">1</td>
<td class="int">10</td>
<td class="int">9960</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (0.2416s)</td>
</tr>
</table>
</div>
<pre>
Table 't_distinct'. Scan count 21, logical reads 3228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 235 ms,  elapsed time = 241 ms.
</pre>
<pre>
  |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[glow]))
       |--Stream Aggregate(GROUP BY:([d].[glow]))
       |    |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), ORDERED FORWARD)
       |--Top(TOP EXPRESSION:((1)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
                 |--Top(TOP EXPRESSION:((1)))
                 |    |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow]) ORDERED FORWARD)
                 |--Top(TOP EXPRESSION:((1)))
                      |--Clustered Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]),  WHERE:([test].[20091130_groupwise].[t_distinct].[glow] as [dii].[glow]=[test].[20091130_groupwise].[t_distinct].[glow] as [d].[glow] AND [test].[20091130_groupwise].[t_distinct].[orderer] as [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)
</pre>
</div>
<p>Now, this query is efficient again, only <strong>250 ms</strong>.</p>
<p>But can we improve the query yet a little?</p>
<h3>Recursive CTE&#8217;s</h3>
<p>Just like <strong>PostgreSQL 8.4</strong>, <strong>SQL Server 2005</strong> uses recursive <strong>CTE</strong>&#8216;s. Unfortunately, it does not use a <a href="/2009/11/18/sql-server-are-the-recursive-ctes-really-set-based/">truly set-based approach</a> to implement them. That&#8217;s why it&#8217;s impossible to <a href="/2009/11/26/postgresql-selecting-records-holding-group-wise-maximum/">use a <strong>LIMIT / TOP</strong> clause</a> to build the list of distinct groupers like we did in <strong>PostgreSQL 8.4</strong>.</p>
<p>However, <strong>SQL Server 2005</strong> allows using window function in the recursive <strong>CTE</strong>&#8216;s, and its optimizer is even smart enough to build an efficient plan for them that would use the <code>TOP</code> access path over the index.</p>
<p>The idea is basically the same as what we did in <strong>PostgreSQL</strong>:</p>
<ol>
<li>Find the lowest grouper in the anchor part of the recursive <strong>CTE</strong></li>
<li>Iteratively find the next groupers in the recursive part</li>
</ol>
<p>The only difference is that we will have to use <code>ROW_NUMBER</code> instead of <code>TOP</code>.</p>
<p>This will jump over the index and find the distinct groupers in no time (of course only if there are not many of them).</p>
<p>Here are the queries:</p>
<h4>Unique rows (no ties)</h4>
<pre class="brush: sql">
WITH    rows AS
        (
        SELECT  MIN(glow) AS glow
        FROM    [20091130_groupwise].t_distinct
        UNION ALL
        SELECT  glow
        FROM    (
                SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
                FROM    rows r
                JOIN    [20091130_groupwise].t_distinct di
                ON      di.glow &gt; r.glow
                WHERE   r.glow IS NOT NULL
                ) q
        WHERE   rn = 1
        )
SELECT  ds.*
FROM    rows r
CROSS APPLY
        (
        SELECT  TOP 1 id, orderer, glow, ghigh
        FROM    [20091130_groupwise].t_distinct d
        WHERE   d.glow = r.glow
        ORDER BY
                id DESC
        ) ds
OPTION (MAXRECURSION 0)
</pre>
<p><a href="#" onclick="xcollapse('X4343');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X4343" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999991</td>
<td class="int">2</td>
<td class="int">1</td>
<td class="int">9991</td>
</tr>
<tr>
<td class="int">999992</td>
<td class="int">3</td>
<td class="int">2</td>
<td class="int">9992</td>
</tr>
<tr>
<td class="int">999993</td>
<td class="int">8</td>
<td class="int">3</td>
<td class="int">9993</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999995</td>
<td class="int">3</td>
<td class="int">5</td>
<td class="int">9995</td>
</tr>
<tr>
<td class="int">999996</td>
<td class="int">9</td>
<td class="int">6</td>
<td class="int">9996</td>
</tr>
<tr>
<td class="int">999997</td>
<td class="int">4</td>
<td class="int">7</td>
<td class="int">9997</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999999</td>
<td class="int">9</td>
<td class="int">9</td>
<td class="int">9999</td>
</tr>
<tr>
<td class="int">1000000</td>
<td class="int">6</td>
<td class="int">10</td>
<td class="int">10000</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (0.0031s)</td>
</tr>
</table>
</div>
<pre>
Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_distinct'. Scan count 21, logical reads 133, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 3 ms.
</pre>
<pre>
  |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
       |--Index Spool(WITH STACK)
       |    |--Concatenation
       |         |--Compute Scalar(DEFINE:([Expr1015]=(0)))
       |         |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
       |         |         |--Top(TOP EXPRESSION:((1)))
       |         |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
       |         |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1017], [Recr1004]))
       |              |--Compute Scalar(DEFINE:([Expr1017]=[Expr1016]+(1)))
       |              |    |--Table Spool(WITH STACK)
       |              |--Filter(WHERE:([Expr1007]=(1)))
       |                   |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)&lt;(0) THEN (0) ELSE (1) END))
       |                        |--Sequence Project(DEFINE:([Expr1007]=row_number))
       |                             |--Compute Scalar(DEFINE:([Expr1014]=(1)))
       |                                  |--Segment
       |                                       |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
       |                                            |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] &gt; [Recr1004]) ORDERED FORWARD)
       |--Index Spool(SEEK:([Recr1008]=[Recr1008]))
            |--Top(TOP EXPRESSION:((1)))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([d].[id], [Expr1019]) WITH ORDERED PREFETCH)
                      |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [d]), SEEK:([d].[glow]=[Recr1008]) ORDERED BACKWARD)
                      |--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [d]), SEEK:([d].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [d].[id]) LOOKUP ORDERED FORWARD)
</pre>
</div>
<h4>Resolving ties by returning the <code>MAX(id)</code> within the <code>MIN(orderer)</code></h4>
<pre class="brush: sql">
WITH    rows AS
        (
        SELECT  MIN(glow) AS glow
        FROM    [20091130_groupwise].t_distinct
        UNION ALL
        SELECT  glow
        FROM    (
                SELECT  di.glow, ROW_NUMBER() OVER (ORDER BY di.glow) AS rn
                FROM    rows r
                JOIN    [20091130_groupwise].t_distinct di
                ON      di.glow &gt; r.glow
                WHERE   r.glow IS NOT NULL
                ) q
        WHERE   rn = 1
        )
SELECT  ds.*
FROM    rows r
CROSS APPLY
        (
        SELECT  TOP 1 dss.*
        FROM    (
                SELECT  TOP 1 orderer
                FROM    [20091130_groupwise].t_distinct di
                WHERE   di.glow = r.glow
                ) ddd
        CROSS APPLY
                (
                SELECT  TOP 1 id, orderer, glow, ghigh
                FROM    [20091130_groupwise].t_distinct dii
                WHERE   dii.glow = r.glow
                        AND dii.orderer = ddd.orderer
                ORDER BY
                        id DESC
                ) dss
        ) ds
</pre>
<p><a href="#" onclick="xcollapse('X5086');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X5086" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>glow</th>
<th>ghigh</th>
</tr>
<tr>
<td class="int">999951</td>
<td class="int">1</td>
<td class="int">1</td>
<td class="int">9951</td>
</tr>
<tr>
<td class="int">999962</td>
<td class="int">1</td>
<td class="int">2</td>
<td class="int">9962</td>
</tr>
<tr>
<td class="int">999933</td>
<td class="int">1</td>
<td class="int">3</td>
<td class="int">9933</td>
</tr>
<tr>
<td class="int">999994</td>
<td class="int">1</td>
<td class="int">4</td>
<td class="int">9994</td>
</tr>
<tr>
<td class="int">999965</td>
<td class="int">1</td>
<td class="int">5</td>
<td class="int">9965</td>
</tr>
<tr>
<td class="int">999906</td>
<td class="int">1</td>
<td class="int">6</td>
<td class="int">9906</td>
</tr>
<tr>
<td class="int">999987</td>
<td class="int">1</td>
<td class="int">7</td>
<td class="int">9987</td>
</tr>
<tr>
<td class="int">999998</td>
<td class="int">1</td>
<td class="int">8</td>
<td class="int">9998</td>
</tr>
<tr>
<td class="int">999969</td>
<td class="int">1</td>
<td class="int">9</td>
<td class="int">9969</td>
</tr>
<tr>
<td class="int">999960</td>
<td class="int">1</td>
<td class="int">10</td>
<td class="int">9960</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0005s (0.0032s)</td>
</tr>
</table>
</div>
<pre>
Table 'Worktable'. Scan count 22, logical reads 142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_distinct'. Scan count 31, logical reads 172, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 3 ms.
</pre>
<pre>
  |--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1008]))
       |--Index Spool(WITH STACK)
       |    |--Concatenation
       |         |--Compute Scalar(DEFINE:([Expr1017]=(0)))
       |         |    |--Stream Aggregate(DEFINE:([Expr1003]=MIN([test].[20091130_groupwise].[t_distinct].[glow])))
       |         |         |--Top(TOP EXPRESSION:((1)))
       |         |              |--Index Scan(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id]), ORDERED FORWARD)
       |         |--Assert(WHERE:(CASE WHEN [Expr1019]&gt;(100) THEN (0) ELSE NULL END))
       |              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1019], [Recr1004]))
       |                   |--Compute Scalar(DEFINE:([Expr1019]=[Expr1018]+(1)))
       |                   |    |--Table Spool(WITH STACK)
       |                   |--Filter(WHERE:([Expr1007]=(1)))
       |                        |--Top(TOP EXPRESSION:(CASE WHEN (1) IS NULL OR (1)&lt;(0) THEN (0) ELSE (1) END))
       |                             |--Sequence Project(DEFINE:([Expr1007]=row_number))
       |                                  |--Compute Scalar(DEFINE:([Expr1016]=(1)))
       |                                       |--Segment
       |                                            |--Filter(WHERE:(STARTUP EXPR([Recr1004] IS NOT NULL)))
       |                                                 |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_id] AS [di]), SEEK:([di].[glow] &gt; [Recr1004]) ORDERED FORWARD)
       |--Index Spool(SEEK:([Recr1008]=[Recr1008]))
            |--Top(TOP EXPRESSION:((1)))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([di].[orderer]))
                      |--Top(TOP EXPRESSION:((1)))
                      |    |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [di]), SEEK:([di].[glow]=[Recr1008]) ORDERED FORWARD)
                      |--Top(TOP EXPRESSION:((1)))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([dii].[id], [Expr1021]) WITH ORDERED PREFETCH)
                                |--Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[ix_distinct_glow_orderer_id] AS [dii]), SEEK:([dii].[glow]=[Recr1008] AND [dii].[orderer]=[test].[20091130_groupwise].[t_distinct].[orderer] as [di].[orderer]) ORDERED BACKWARD)
                                |--Clustered Index Seek(OBJECT:([test].[20091130_groupwise].[t_distinct].[PK__t_distinct__7CCF64C8] AS [dii]), SEEK:([dii].[id]=[test].[20091130_groupwise].[t_distinct].[id] as [dii].[id]) LOOKUP ORDERED FORWARD)
</pre>
</div>
<p>Both these queries are very complex and scary to look at.</p>
<p>However, they complete in only <strong>3 ms</strong>. This is <strong>3,000</strong> (!) times as efficient as a plain <code>ROW_NUMBER</code> query, though the latter is much more simple to understand.</p>
<h3>Summary</h3>
<p>To select records holding group-wise maximums in <strong>SQL Server</strong>, the following approaches can be used:</p>
<ul>
<li>Analytic functions</li>
<li>Using <code>DISTINCT</code> / <code>CROSS APPLY</code> / <code>TOP</code></li>
<li>Using recursive <strong>CTE</strong>&#8216;s / <code>CROSS APPLY</code> / <code>TOP</code></li>
</ul>
<p>These methods are arranged in order of increasing complexity and decreasing cost (for low-cardinality groupers).</p>
<p>In there are few groupers in the table, the index scan (used by <code>DISTINCT</code>) or a loose index scan (emulated by the recursive <strong>CTE</strong>&#8216;s) help to build a list of distinct groupers much faster and the <code>CROSS APPLY</code> then uses the secondary indexes to select the first record within each group.</p>
<p><strong>To be continued.</strong></p>
<div class='wb_fb_bottom'><!-- Wordbooker created FB tags --> <iframe src="http://www.facebook.com/plugins/like.php?locale=en_US&href=http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" scrolling="no" frameborder="no" style="border:none; overflow:hidden; width:250px; height:35px;" allowTransparency="true"></iframe><div style="float:right;"><!-- Wordbooker created FB tags --> <a name="fb_share" type="button" share_url="http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/"></a></div></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/11/30/sql-server-selecting-records-holding-group-wise-maximum/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

