Counting missing rows: SQL Server at EXPLAIN EXTENDED

Counting missing rows: SQL Server

Comments enabled. I *really* need your comment

This is the 3^rd of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

NOT IN in MySQL
NOT IN in MySQL (using DISTINCT)
NOT IN in SQL Server
NOT IN in Oracle
NOT IN in PostgreSQL

Now, let's see how the same predicate is implemented in SQL Server.

We'll create sample tables just like those we used yesterday:

Table creation details

SET NOCOUNT ON
GO
DROP TABLE [20090420_in].table1
DROP TABLE [20090420_in].table2
DROP SCHEMA [20090420_in]
GO
CREATE SCHEMA [20090420_in]
CREATE TABLE [20090420_in].Table1 (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
value VARCHAR(50) NOT NULL,
)
CREATE TABLE [20090420_in].Table2 (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)
GO
CREATE INDEX IX_Table1_grouper ON [20090420_in].Table1(grouper)
GO
DECLARE @cnt INT;
SET @cnt = 1;
BEGIN TRANSACTION
WHILE @cnt &lt;= 1000000
BEGIN
        INSERT
        INTO    [20090420_in].Table1 (id, grouper, value)
        VALUES  (@cnt, @cnt % 1000, 'Value One' + CAST(@cnt AS VARCHAR))
        SET @cnt = @cnt + 1
END

INSERT
INTO    [20090420_in].table2 (id, value)
SELECT  TOP 1000
        id, 'Value Two ' + CAST(id AS VARCHAR)
FROM    [20090420_in].table1

COMMIT
GO

Now, let's run the original query:

SELECT  COUNT(*) AS cnt
FROM    [20090420_in].Table1 t1
WHERE   id NOT IN
(
SELECT  id
FROM    [20090420_in].Table2
)

cnt
999000
1 row fetched in 0.0002s (0.2945s)

Table 'Table2'. Scan count 3, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Table1'. Scan count 3, logical reads 1496, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 421 ms,  elapsed time = 294 ms.

  |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[globalagg1007],0)))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*)))
                      |--Hash Match(Right Anti Semi Join, HASH:([test].[20090420_in].[Table2].[id])=([t1].[id]))
                           |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090420_in].[Table2].[id]))
                           |    |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]))
                           |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[id]))
                                |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]))

This query completes quite fast.

If we look into the execution plan for this query will will see that the major join operation is performed with a RIGHT MERGE ANTI SEMI JOIN.

This algorithm scans both table1 and table2 simultaneously, using the clustered data order. It uses a pointer variable showing the place in the index the engine is scanning at now.

On each iteration, the engline finds which table has the next value (i. e. the minimal value that is greater that the current value of the pointer) and advances the pointer to this next value.

If (and only if) this next value is present in table1 but not in table2, then the resulting COUNT(*) is incremented.

This is quite an efficient algorithm and it completes in only 294 ms (unlike 4.64 seconds with MySQL).

If we search for the absent grouper's instead of id's, it completes even sooner:

SELECT  COUNT(*)
FROM    [20090420_in].Table1 t1
WHERE   grouper NOT IN
(
SELECT  id
FROM    [20090420_in].Table2
)


1000
1 row fetched in 0.0002s (0.2257s)

Table 'Table1'. Scan count 1, logical reads 1365, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 219 ms,  elapsed time = 225 ms.

  |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1008],0)))
       |--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
            |--Merge Join(Right Anti Semi Join, MERGE:([test].[20090420_in].[Table2].[id])=([t1].[grouper]), RESIDUAL:([test].[20090420_in].[Table1].[grouper] as [t1].[grouper]=[test].[20090420_in].[Table2].[id]))
                 |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]), ORDERED FORWARD)
                 |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]), ORDERED FORWARD)

Now, let's see if we can benefit from the tricks we used in MySQL.

These tricks include reverse COUNT(*) and joining on DISTINCT.

In reverse COUNT(*), we COUNT(*) the values that are IN the subquery, and then subtract the result from the total COUNT(*):

SELECT  (
SELECT  COUNT(*)
FROM    [20090420_in].Table1
) -
(
SELECT  COUNT(*)
FROM    [20090420_in].Table1
WHERE   id IN
(
SELECT  id
FROM    [20090420_in].Table2
)
)


999000
1 row fetched in 0.0002s (0.1239s)

Table 'Table1'. Scan count 1, logical reads 4436, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 125 ms,  elapsed time = 124 ms.

  |--Compute Scalar(DEFINE:([Expr1013]=[Expr1003]-[Expr1011]))
       |--Nested Loops(Inner Join)
            |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1018],0)))
            |    |--Stream Aggregate(DEFINE:([Expr1018]=Count(*)))
            |         |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id]))
            |--Compute Scalar(DEFINE:([Expr1011]=CONVERT_IMPLICIT(int,[Expr1020],0)))
                 |--Stream Aggregate(DEFINE:([Expr1020]=Count(*)))
                      |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20090420_in].[Table2].[id], [Expr1019]) OPTIMIZED WITH UNORDERED PREFETCH)
                           |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]))
                           |--Clustered Index Seek(OBJECT:([test].[20090420_in].[Table1].[PK__Table1__69C6B1F5]), SEEK:([test].[20090420_in].[Table1].[id]=[test].[20090420_in].[Table2].[id]) ORDERED FORWARD)

We can see an apparent benefit from using the COUNT(*), despite the fact that SQL Server, unlike MyISAM, does not keep the COUNT(*) in metadata.

This is because the original query (using RIGHT MERGE ANTI SEMI JOIN) could not benefit from parallelization, while this one (which is in fact a COUNT(*) and an INNER JOIN) can.

Execution plan shows parallelization on both these operations.

Note that CPU time is twise the execution time: on my dual-core machine, two processors work in parallel.

Now, what about DISTINCT?

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT grouper
FROM    [20090420_in].Table1
) d
JOIN    [20090420_in].Table1 t1
ON      t1.grouper = d.grouper
WHERE   d.grouper NOT IN
(
SELECT  id
FROM    [20090420_in].Table2
)


1000
1 row fetched in 0.0002s (0.2369s)

Table 'Table1'. Scan count 2, logical reads 1373, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 235 ms,  elapsed time = 237 ms.

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20090420_in].[Table1].[grouper], [Expr1011]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Merge Join(Left Anti Semi Join, MERGE:([test].[20090420_in].[Table1].[grouper])=([test].[20090420_in].[Table2].[id]), RESIDUAL:([test].[20090420_in].[Table1].[grouper]=[test].[20090420_in].[Table2].[id]))
                 |    |--Stream Aggregate(GROUP BY:([test].[20090420_in].[Table1].[grouper]))
                 |    |    |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id]), ORDERED FORWARD)
                 |    |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]), ORDERED FORWARD)
                 |--Index Seek(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]), SEEK:([t1].[grouper]=[test].[20090420_in].[Table1].[grouper]) ORDERED FORWARD)

In this case, there is hardly a performance benefit, even with a highly selective index on grouper.

This is because the original query plan, unlike MySQL, is in fact quite optimal, and is even close to this one. This query benefits from using a RIGHT MERGE ANTI SEMI JOIN on a smaller dataset (DISTINCT grouper), but loses from using an additional join with Table1 on CLUSTERED INDEX.

Summary

SQL Server uses RIGHT MERGE ANTI SEMI JOIN for NOT IN queries, which is quite efficient by itself and cannot be improved by selecting DISTINCT values from the table searched for missing rows.

However, if we need to COUNT(*) the values, SQL Server can benefit from reversing the conditions and substracting the COUNT(*) of matching rows from the total COUNT(*).

Both these operations can be parallelized and run more efficiently on multiprocessor systems.

To be continued.

Written by Quassnoi

April 20th, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED