SQL Server: EXCEPT ALL at EXPLAIN EXTENDED

SQL Server: EXCEPT ALL

Answering questions asked on the site.

myst asks:

I have two really large tables with lots of columns, many of them are nullable

I need to remove all rows from table1 which are not present in table2, but there can be duplicates in both tables and writing more than 70 IS NULL conditions would be a pain, and I want to make sure there's nothing I'm missing.

Is there a more simple way?

SQL Server supports EXCEPT clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML operation.

ANSI SQL standard describes EXCEPT ALL which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.

Similar behavior can be achieved using NOT IN or NOT EXISTS constructs. But in SQL Server, IN predicate does not accept more than one field. NOT EXISTS accepts any number of correlated columns, but it requires extra checks in the WHERE clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR predicates or COALESCE, neither of which adds to performance.

But there is a way to emulate EXCEPT ALL in SQL Server quite elegantly and efficiently.

Let's create two sample tables:

Table creation details

CREATE SCHEMA [20100211_except]
CREATE TABLE t1 (
val1 INT,
val2 INT,
val3 INT,
val4 INT,
val5 INT
)
CREATE TABLE t2 (
val1 INT,
val2 INT,
val3 INT,
val4 INT,
val5 INT
)
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 0
WHILE @cnt &lt; 100000
BEGIN
        INSERT
        INTO    [20100211_except].t1
        VALUES  (
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                FLOOR(@cnt / 10),
                CASE WHEN @cnt % 2 = 0 THEN FLOOR(@cnt / 10) ELSE NULL END
                )
        SET @cnt = @cnt + 1
END
INSERT
INTO    [20100211_except].t2
SELECT  TOP 99980 *
FROM    [20100211_except].t1
ORDER BY
        val1, val2, val3, val4, val5
COMMIT
GO

The second table is 20 records short of being the full copy of the first table. There are NULL values and duplicates in both tables.

The EXCEPT query returns this:

SELECT  *
FROM    [20100211_except].t1 AS t1
EXCEPT
SELECT  *
FROM    [20100211_except].t2 AS t2

val1	val2	val3	val4	val5
9999	9999	9999	9999
9998	9998	9998	9998	9998
9998	9998	9998	9998
9999	9999	9999	9999	9999
4 rows fetched in 0.0011s (0.1933s)

View query details

Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 376 ms,  elapsed time = 190 ms.

  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Hash Match(Aggregate, HASH:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t1].[val1] as [t1].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t1].[val2] as [t1].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t1].[val3] as [t1].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t1].[val4] as [t1].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t1].[val5] as [t1].[val5]))
                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                      |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))

There are only 4 records in the resultset, since the duplicates are eliminated. And you cannot delete from this resultset.

Here's the EXISTS workaround:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    [20100211_except].t2 AS t2
WHERE   1 = 1
AND (t1.val1 = t2.val1 OR (t1.val1 IS NULL AND t2.val1 IS NULL))
AND (t1.val2 = t2.val2 OR (t1.val2 IS NULL AND t2.val2 IS NULL))
AND (t1.val3 = t2.val3 OR (t1.val3 IS NULL AND t2.val3 IS NULL))
AND (t1.val4 = t2.val4 OR (t1.val4 IS NULL AND t2.val4 IS NULL))
AND (t1.val5 = t2.val5 OR (t1.val5 IS NULL AND t2.val5 IS NULL))
)

val1	val2	val3	val4	val5
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
20 rows fetched in 0.0029s (0.1953s)

View query details

Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 344 ms,  elapsed time = 194 ms.

  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                 |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))

This returns all records, but you just look at that query. And it's only 5 columns, not 70.

To emulate EXCEPT ALL, we can use a little trick.

Unlike PostgreSQL, SQL Server does not support record types. You cannot return a whole record in one field. But it is possible to return a whole record in a tableless SELECT in the correlated subquery. And plain EXCEPT operator can be applied to this SELECT.

Using this trick, EXCEPT may be applied to each individual record from the first table, not to the table as a whole. If this record is presented among the other table's records, nothing will be returned by the operator; otherwise a single row which was initially supplied will be returned. This can be filtered using EXISTS clause.

Here's how it looks:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2 AS t2
)

val1	val2	val3	val4	val5
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9999	9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999
20 rows fetched in 0.0033s (0.7656s)

View query details

Table 'Worktable'. Scan count 100000, logical reads 555796, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't2'. Scan count 1, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 984 ms,  elapsed time = 758 ms.

  |--Parallelism(Gather Streams)
       |--Nested Loops(Left Semi Join, OUTER REFERENCES:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
            |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
            |--Nested Loops(Left Anti Semi Join)
                 |--Constant Scan
                 |--Index Spool(SEEK:([t2].[val1]=[test].[20100211_except].[t1].[val1] as [t1].[val1] AND [t2].[val2]=[test].[20100211_except].[t1].[val2] as [t1].[val2] AND [t2].[val3]=[test].[20100211_except].[t1].[val3] as [t1].[val3] AND [t2].[val4]=[test].[20100211_except].[t1].[val4] as [t1].[val4] AND [t2].[val5]=[test].[20100211_except].[t1].[val5] as [t1].[val5]))
                      |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))

All 20 records are on their places.

Note that this query didn't use any actual field names at all: only pure set-based operators.

This is a little bit less efficient than a normal field-based EXISTS, because this query is forced to use Nested Loops. However, with an index that is built in-place by the Eager Spool operation, this is still quite fast.

This query behaves just like EXCEPT ALL but tecnhically it's a single-table SELECT query without joins or aggregations. And as such, it can be used in the DML.

Use this query to delete all records absent in the second table:

DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2
)

, and this one to delete records present in the second table:

DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2
)

Beware: EXISTS predicate returns absent records, and NOT EXISTS returns present records.

This is not very intuitive, but may become more clear if you realize that the subquery effectively checks for absence of the current row, so the EXISTS predicate in fact says where exists the absence of the current row.

Hope that helps.

Update of Feb 12th, 2010

In a feedback, Gonçalo Ferreira suggested using INTERSECT instead of EXCEPT in the inner subquery:

Wouldn't using INTERSECT in the sub-query avoid the EXISTS - absent / NOT EXISTS - present confusion?

Let's try it:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  t1.*
INTERSECT
SELECT  *
FROM    [20100211_except].t2 AS t2
)

val1	val2	val3	val4	val5
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9999	9999	9999	9999	9999
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9998	9998	9998	9998	9998
9998	9998	9998	9998
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
9999	9999	9999	9999
20 rows fetched in 0.0027s (0.1754s)

View query details

Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 343 ms,  elapsed time = 174 ms.

  |--Parallelism(Gather Streams)
       |--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
            |    |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
                 |--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))

Not only this is more readable, but this is also much faster, since it uses a single Hash Match (Right Anti Semi Join) (which is the same thing as the equality EXISTS clause used).

Nice point, Gonçalo!

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 10th, 2010 at 11:00 pm

Posted in SQL Server

2 Responses to 'SQL Server: EXCEPT ALL'

Subscribe to comments with RSS

This isn’t the “except all” semantics. Except all requires if table 1 has 5 “1”s and table 2 has 3 “1”s, table 1 except all table 2 returns 2 rows. Your solution returns no row.

Jian

18 Nov 15 at 09:35
@Jian: thanks, you’re right. Will update the post.

Quassnoi

18 Nov 15 at 15:28

EXPLAIN EXTENDED