Counting missing rows: SQL Server
Comments enabled. I *really* need your comment
This is the 3rd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Now, let's see how the same predicate is implemented in SQL Server.
We'll create sample tables just like those we used yesterday:
Table creation details
Now, let's run the original query:
SELECT COUNT(*) AS cnt FROM [20090420_in].Table1 t1 WHERE id NOT IN ( SELECT id FROM [20090420_in].Table2 )
cnt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.2945s) |
Table 'Table2'. Scan count 3, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table1'. Scan count 3, logical reads 1496, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 421 ms, elapsed time = 294 ms.
|--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[globalagg1007],0))) |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]))) |--Parallelism(Gather Streams) |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*))) |--Hash Match(Right Anti Semi Join, HASH:([test].[20090420_in].[Table2].[id])=([t1].[id])) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20090420_in].[Table2].[id])) | |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67])) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[id])) |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]))
This query completes quite fast.
If we look into the execution plan for this query will will see that the major join operation is performed with a RIGHT MERGE ANTI SEMI JOIN
.
This algorithm scans both table1
and table2
simultaneously, using the clustered data order. It uses a pointer variable showing the place in the index the engine is scanning at now.
On each iteration, the engline finds which table has the next
value (i. e. the minimal value that is greater that the current value of the pointer) and advances the pointer to this next
value.
If (and only if) this next
value is present in table1
but not in table2
, then the resulting COUNT(*)
is incremented.
This is quite an efficient algorithm and it completes in only 294 ms (unlike 4.64 seconds with MySQL).
If we search for the absent grouper
's instead of id
's, it completes even sooner:
SELECT COUNT(*) FROM [20090420_in].Table1 t1 WHERE grouper NOT IN ( SELECT id FROM [20090420_in].Table2 )
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.2257s) |
Table 'Table1'. Scan count 1, logical reads 1365, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 219 ms, elapsed time = 225 ms.
|--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1008],0))) |--Stream Aggregate(DEFINE:([Expr1008]=Count(*))) |--Merge Join(Right Anti Semi Join, MERGE:([test].[20090420_in].[Table2].[id])=([t1].[grouper]), RESIDUAL:([test].[20090420_in].[Table1].[grouper] as [t1].[grouper]=[test].[20090420_in].[Table2].[id])) |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]), ORDERED FORWARD) |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]), ORDERED FORWARD)
Now, let's see if we can benefit from the tricks we used in MySQL.
These tricks include reverse COUNT(*)
and joining on DISTINCT
.
In reverse COUNT(*)
, we COUNT(*)
the values that are IN
the subquery, and then subtract the result from the total COUNT(*)
:
SELECT ( SELECT COUNT(*) FROM [20090420_in].Table1 ) - ( SELECT COUNT(*) FROM [20090420_in].Table1 WHERE id IN ( SELECT id FROM [20090420_in].Table2 ) )
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.1239s) |
Table 'Table1'. Scan count 1, logical reads 4436, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 125 ms, elapsed time = 124 ms.
|--Compute Scalar(DEFINE:([Expr1013]=[Expr1003]-[Expr1011])) |--Nested Loops(Inner Join) |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1018],0))) | |--Stream Aggregate(DEFINE:([Expr1018]=Count(*))) | |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id])) |--Compute Scalar(DEFINE:([Expr1011]=CONVERT_IMPLICIT(int,[Expr1020],0))) |--Stream Aggregate(DEFINE:([Expr1020]=Count(*))) |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20090420_in].[Table2].[id], [Expr1019]) OPTIMIZED WITH UNORDERED PREFETCH) |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67])) |--Clustered Index Seek(OBJECT:([test].[20090420_in].[Table1].[PK__Table1__69C6B1F5]), SEEK:([test].[20090420_in].[Table1].[id]=[test].[20090420_in].[Table2].[id]) ORDERED FORWARD)
We can see an apparent benefit from using the COUNT(*)
, despite the fact that SQL Server, unlike MyISAM, does not keep the COUNT(*)
in metadata.
This is because the original query (using RIGHT MERGE ANTI SEMI JOIN
) could not benefit from parallelization, while this one (which is in fact a COUNT(*)
and an INNER JOIN
) can.
Execution plan shows parallelization on both these operations.
Note that CPU time
is twise the execution time
: on my dual-core machine, two processors work in parallel.
Now, what about DISTINCT
?
SELECT COUNT(*) FROM ( SELECT DISTINCT grouper FROM [20090420_in].Table1 ) d JOIN [20090420_in].Table1 t1 ON t1.grouper = d.grouper WHERE d.grouper NOT IN ( SELECT id FROM [20090420_in].Table2 )
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.2369s) |
Table 'Table1'. Scan count 2, logical reads 1373, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table2'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 235 ms, elapsed time = 237 ms.
|--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0))) |--Stream Aggregate(DEFINE:([Expr1012]=Count(*))) |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20090420_in].[Table1].[grouper], [Expr1011]) OPTIMIZED WITH UNORDERED PREFETCH) |--Merge Join(Left Anti Semi Join, MERGE:([test].[20090420_in].[Table1].[grouper])=([test].[20090420_in].[Table2].[id]), RESIDUAL:([test].[20090420_in].[Table1].[grouper]=[test].[20090420_in].[Table2].[id])) | |--Stream Aggregate(GROUP BY:([test].[20090420_in].[Table1].[grouper])) | | |--Index Scan(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id]), ORDERED FORWARD) | |--Clustered Index Scan(OBJECT:([test].[20090420_in].[Table2].[PK__Table2__6BAEFA67]), ORDERED FORWARD) |--Index Seek(OBJECT:([test].[20090420_in].[Table1].[IX_Table1_grouper_id] AS [t1]), SEEK:([t1].[grouper]=[test].[20090420_in].[Table1].[grouper]) ORDERED FORWARD)
In this case, there is hardly a performance benefit, even with a highly selective index on grouper.
This is because the original query plan, unlike MySQL, is in fact quite optimal, and is even close to this one. This query benefits from using a RIGHT MERGE ANTI SEMI JOIN
on a smaller dataset (DISTINCT grouper
), but loses from using an additional join with Table1
on CLUSTERED INDEX
.
Summary
SQL Server uses RIGHT MERGE ANTI SEMI JOIN
for NOT IN
queries, which is quite efficient by itself and cannot be improved by selecting DISTINCT
values from the table searched for missing rows.
However, if we need to COUNT(*)
the values, SQL Server
can benefit from reversing the conditions and substracting the COUNT(*)
of matching rows from the total COUNT(*)
.
Both these operations can be parallelized and run more efficiently on multiprocessor systems.
To be continued.