EXPLAIN EXTENDED

How to create fast database queries

SQL Server: EXCEPT ALL

with 2 comments

Answering questions asked on the site.

myst asks:

I have two really large tables with lots of columns, many of them are nullable

I need to remove all rows from table1 which are not present in table2, but there can be duplicates in both tables and writing more than 70 IS NULL conditions would be a pain, and I want to make sure there's nothing I'm missing.

Is there a more simple way?

SQL Server supports EXCEPT clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML operation.

ANSI SQL standard describes EXCEPT ALL which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.

Similar behavior can be achieved using NOT IN or NOT EXISTS constructs. But in SQL Server, IN predicate does not accept more than one field. NOT EXISTS accepts any number of correlated columns, but it requires extra checks in the WHERE clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR predicates or COALESCE, neither of which adds to performance.

But there is a way to emulate EXCEPT ALL in SQL Server quite elegantly and efficiently.

Let's create two sample tables:

Table creation details

The second table is 20 records short of being the full copy of the first table. There are NULL values and duplicates in both tables.

The EXCEPT query returns this:

SELECT  *
FROM    [20100211_except].t1 AS t1
EXCEPT
SELECT  *
FROM    [20100211_except].t2 AS t2
val1 val2 val3 val4 val5
9999 9999 9999 9999
9998 9998 9998 9998 9998
9998 9998 9998 9998
9999 9999 9999 9999 9999
4 rows fetched in 0.0011s (0.1933s)

View query details

There are only 4 records in the resultset, since the duplicates are eliminated. And you cannot delete from this resultset.

Here's the EXISTS workaround:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    [20100211_except].t2 AS t2
WHERE   1 = 1
AND (t1.val1 = t2.val1 OR (t1.val1 IS NULL AND t2.val1 IS NULL))
AND (t1.val2 = t2.val2 OR (t1.val2 IS NULL AND t2.val2 IS NULL))
AND (t1.val3 = t2.val3 OR (t1.val3 IS NULL AND t2.val3 IS NULL))
AND (t1.val4 = t2.val4 OR (t1.val4 IS NULL AND t2.val4 IS NULL))
AND (t1.val5 = t2.val5 OR (t1.val5 IS NULL AND t2.val5 IS NULL))
)
val1 val2 val3 val4 val5
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
20 rows fetched in 0.0029s (0.1953s)

View query details

This returns all records, but you just look at that query. And it's only 5 columns, not 70.

To emulate EXCEPT ALL, we can use a little trick.

Unlike PostgreSQL, SQL Server does not support record types. You cannot return a whole record in one field. But it is possible to return a whole record in a tableless SELECT in the correlated subquery. And plain EXCEPT operator can be applied to this SELECT.

Using this trick, EXCEPT may be applied to each individual record from the first table, not to the table as a whole. If this record is presented among the other table's records, nothing will be returned by the operator; otherwise a single row which was initially supplied will be returned. This can be filtered using EXISTS clause.

Here's how it looks:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2 AS t2
)
val1 val2 val3 val4 val5
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9999 9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999
20 rows fetched in 0.0033s (0.7656s)

View query details

All 20 records are on their places.

Note that this query didn't use any actual field names at all: only pure set-based operators.

This is a little bit less efficient than a normal field-based EXISTS, because this query is forced to use Nested Loops. However, with an index that is built in-place by the Eager Spool operation, this is still quite fast.

This query behaves just like EXCEPT ALL but tecnhically it's a single-table SELECT query without joins or aggregations. And as such, it can be used in the DML.

Use this query to delete all records absent in the second table:

DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2
)

, and this one to delete records present in the second table:

DELETE  t1
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  t1.*
EXCEPT
SELECT  *
FROM    [20100211_except].t2
)

Beware: EXISTS predicate returns absent records, and NOT EXISTS returns present records.

This is not very intuitive, but may become more clear if you realize that the subquery effectively checks for absence of the current row, so the EXISTS predicate in fact says where exists the absence of the current row.

Hope that helps.

Update of Feb 12th, 2010

In a feedback, Gonçalo Ferreira suggested using INTERSECT instead of EXCEPT in the inner subquery:

Wouldn't using INTERSECT in the sub-query avoid the EXISTS - absent / NOT EXISTS - present confusion?

Let's try it:

SELECT  *
FROM    [20100211_except].t1 AS t1
WHERE   NOT EXISTS
(
SELECT  t1.*
INTERSECT
SELECT  *
FROM    [20100211_except].t2 AS t2
)
val1 val2 val3 val4 val5
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9999 9999 9999 9999 9999
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9998 9998 9998 9998 9998
9998 9998 9998 9998
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
9999 9999 9999 9999
20 rows fetched in 0.0027s (0.1754s)

View query details

Not only this is more readable, but this is also much faster, since it uses a single Hash Match (Right Anti Semi Join) (which is the same thing as the equality EXISTS clause used).

Nice point, Gonçalo!


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 10th, 2010 at 11:00 pm

Posted in SQL Server

2 Responses to 'SQL Server: EXCEPT ALL'

Subscribe to comments with RSS

  1. This isn’t the “except all” semantics. Except all requires if table 1 has 5 “1”s and table 2 has 3 “1”s, table 1 except all table 2 returns 2 rows. Your solution returns no row.

    Jian

    18 Nov 15 at 09:35

  2. @Jian: thanks, you’re right. Will update the post.

    Quassnoi

    18 Nov 15 at 15:28

Leave a Reply