Answering questions asked on the site.
myst asks:
I have two really large tables with lots of columns, many of them are nullable
I need to remove all rows from table1
which are not present in table2
, but there can be duplicates in both tables and writing more than 70 IS NULL
conditions would be a pain, and I want to make sure there's nothing I'm missing.
Is there a more simple way?
SQL Server supports EXCEPT
clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML
operation.
ANSI SQL standard describes EXCEPT ALL
which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.
Similar behavior can be achieved using NOT IN
or NOT EXISTS
constructs. But in SQL Server, IN
predicate does not accept more than one field. NOT EXISTS
accepts any number of correlated columns, but it requires extra checks in the WHERE
clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR
predicates or COALESCE
, neither of which adds to performance.
But there is a way to emulate EXCEPT ALL
in SQL Server quite elegantly and efficiently.
Let's create two sample tables:
Table creation details
CREATE SCHEMA [20100211_except]
CREATE TABLE t1 (
val1 INT,
val2 INT,
val3 INT,
val4 INT,
val5 INT
)
CREATE TABLE t2 (
val1 INT,
val2 INT,
val3 INT,
val4 INT,
val5 INT
)
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 0
WHILE @cnt < 100000
BEGIN
INSERT
INTO [20100211_except].t1
VALUES (
FLOOR(@cnt / 10),
FLOOR(@cnt / 10),
FLOOR(@cnt / 10),
FLOOR(@cnt / 10),
CASE WHEN @cnt % 2 = 0 THEN FLOOR(@cnt / 10) ELSE NULL END
)
SET @cnt = @cnt + 1
END
INSERT
INTO [20100211_except].t2
SELECT TOP 99980 *
FROM [20100211_except].t1
ORDER BY
val1, val2, val3, val4, val5
COMMIT
GO
The second table is 20 records short of being the full copy of the first table. There are NULL values and duplicates in both tables.
The EXCEPT
query returns this:
SELECT *
FROM [20100211_except].t1 AS t1
EXCEPT
SELECT *
FROM [20100211_except].t2 AS t2
val1 |
val2 |
val3 |
val4 |
val5 |
9999 |
9999 |
9999 |
9999 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
4 rows fetched in 0.0011s (0.1933s) |
View query details
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 376 ms, elapsed time = 190 ms.
|--Parallelism(Gather Streams)
|--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
| |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
|--Hash Match(Aggregate, HASH:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t1].[val1] as [t1].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t1].[val2] as [t1].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t1].[val3] as [t1].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t1].[val4] as [t1].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t1].[val5] as [t1].[val5]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
|--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
There are only 4 records in the resultset, since the duplicates are eliminated. And you cannot delete from this resultset.
Here's the EXISTS
workaround:
SELECT *
FROM [20100211_except].t1 AS t1
WHERE NOT EXISTS
(
SELECT NULL
FROM [20100211_except].t2 AS t2
WHERE 1 = 1
AND (t1.val1 = t2.val1 OR (t1.val1 IS NULL AND t2.val1 IS NULL))
AND (t1.val2 = t2.val2 OR (t1.val2 IS NULL AND t2.val2 IS NULL))
AND (t1.val3 = t2.val3 OR (t1.val3 IS NULL AND t2.val3 IS NULL))
AND (t1.val4 = t2.val4 OR (t1.val4 IS NULL AND t2.val4 IS NULL))
AND (t1.val5 = t2.val5 OR (t1.val5 IS NULL AND t2.val5 IS NULL))
)
val1 |
val2 |
val3 |
val4 |
val5 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
20 rows fetched in 0.0029s (0.1953s) |
View query details
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 344 ms, elapsed time = 194 ms.
|--Parallelism(Gather Streams)
|--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
| |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
|--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
This returns all records, but you just look at that query. And it's only 5 columns, not 70.
To emulate EXCEPT ALL
, we can use a little trick.
Unlike PostgreSQL, SQL Server does not support record types. You cannot return a whole record in one field. But it is possible to return a whole record in a tableless SELECT
in the correlated subquery. And plain EXCEPT
operator can be applied to this SELECT
.
Using this trick, EXCEPT
may be applied to each individual record from the first table, not to the table as a whole. If this record is presented among the other table's records, nothing will be returned by the operator; otherwise a single row which was initially supplied will be returned. This can be filtered using EXISTS
clause.
Here's how it looks:
SELECT *
FROM [20100211_except].t1 AS t1
WHERE EXISTS
(
SELECT t1.*
EXCEPT
SELECT *
FROM [20100211_except].t2 AS t2
)
val1 |
val2 |
val3 |
val4 |
val5 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
|
20 rows fetched in 0.0033s (0.7656s) |
View query details
Table 'Worktable'. Scan count 100000, logical reads 555796, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't2'. Scan count 1, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 984 ms, elapsed time = 758 ms.
|--Parallelism(Gather Streams)
|--Nested Loops(Left Semi Join, OUTER REFERENCES:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
|--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
|--Nested Loops(Left Anti Semi Join)
|--Constant Scan
|--Index Spool(SEEK:([t2].[val1]=[test].[20100211_except].[t1].[val1] as [t1].[val1] AND [t2].[val2]=[test].[20100211_except].[t1].[val2] as [t1].[val2] AND [t2].[val3]=[test].[20100211_except].[t1].[val3] as [t1].[val3] AND [t2].[val4]=[test].[20100211_except].[t1].[val4] as [t1].[val4] AND [t2].[val5]=[test].[20100211_except].[t1].[val5] as [t1].[val5]))
|--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
All 20 records are on their places.
Note that this query didn't use any actual field names at all: only pure set-based operators.
This is a little bit less efficient than a normal field-based EXISTS
, because this query is forced to use Nested Loops
. However, with an index that is built in-place by the Eager Spool
operation, this is still quite fast.
This query behaves just like EXCEPT ALL
but tecnhically it's a single-table SELECT
query without joins or aggregations. And as such, it can be used in the DML
.
Use this query to delete all records absent in the second table:
DELETE t1
FROM [20100211_except].t1 AS t1
WHERE EXISTS
(
SELECT t1.*
EXCEPT
SELECT *
FROM [20100211_except].t2
)
, and this one to delete records present in the second table:
DELETE t1
FROM [20100211_except].t1 AS t1
WHERE NOT EXISTS
(
SELECT t1.*
EXCEPT
SELECT *
FROM [20100211_except].t2
)
Beware: EXISTS
predicate returns absent records, and NOT EXISTS
returns present records.
This is not very intuitive, but may become more clear if you realize that the subquery effectively checks for absence of the current row, so the EXISTS
predicate in fact says where exists the absence of the current row
.
Hope that helps.
Update of Feb 12th, 2010
In a feedback, Gonçalo Ferreira suggested using INTERSECT
instead of EXCEPT
in the inner subquery:
Wouldn't using INTERSECT
in the sub-query avoid the EXISTS
- absent / NOT EXISTS
- present
confusion?
Let's try it:
SELECT *
FROM [20100211_except].t1 AS t1
WHERE NOT EXISTS
(
SELECT t1.*
INTERSECT
SELECT *
FROM [20100211_except].t2 AS t2
)
val1 |
val2 |
val3 |
val4 |
val5 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9999 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
9998 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
9999 |
9999 |
9999 |
9999 |
|
20 rows fetched in 0.0027s (0.1754s) |
View query details
Table 't2'. Scan count 3, logical reads 532, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 3, logical reads 559, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 343 ms, elapsed time = 174 ms.
|--Parallelism(Gather Streams)
|--Hash Match(Right Anti Semi Join, HASH:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5])=([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]), RESIDUAL:([test].[20100211_except].[t1].[val1] as [t1].[val1] = [test].[20100211_except].[t2].[val1] as [t2].[val1] AND [test].[20100211_except].[t1].[val2] as [t1].[val2] = [test].[20100211_except].[t2].[val2] as [t2].[val2] AND [test].[20100211_except].[t1].[val3] as [t1].[val3] = [test].[20100211_except].[t2].[val3] as [t2].[val3] AND [test].[20100211_except].[t1].[val4] as [t1].[val4] = [test].[20100211_except].[t2].[val4] as [t2].[val4] AND [test].[20100211_except].[t1].[val5] as [t1].[val5] = [test].[20100211_except].[t2].[val5] as [t2].[val5]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t2].[val1], [t2].[val2], [t2].[val3], [t2].[val4], [t2].[val5]))
| |--Table Scan(OBJECT:([test].[20100211_except].[t2] AS [t2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([t1].[val1], [t1].[val2], [t1].[val3], [t1].[val4], [t1].[val5]))
|--Table Scan(OBJECT:([test].[20100211_except].[t1] AS [t1]))
Not only this is more readable, but this is also much faster, since it uses a single Hash Match (Right Anti Semi Join)
(which is the same thing as the equality EXISTS
clause used).
Nice point, Gonçalo!
I'm always glad to answer the questions regarding database queries.
Ask me a question
This isn’t the “except all” semantics. Except all requires if table 1 has 5 “1”s and table 2 has 3 “1”s, table 1 except all table 2 returns 2 rows. Your solution returns no row.
Jian
18 Nov 15 at 09:35
@Jian: thanks, you’re right. Will update the post.
Quassnoi
18 Nov 15 at 15:28