SQL Server: joining NULL values at EXPLAIN EXTENDED

UPDATE  n
SET    AddressID = a.AddressID
FROM   #NewAddress n
JOIN   dbo.Address a
ON     (a.[LineA] = n.[LineA] OR (a.[LineA] is null AND n.[LineA] is null))
AND (a.[LineB] = n.[LineB] OR (a.[LineB] is null AND n.[LineB] is null))
AND (a.[LineC] = n.[LineC] OR (a.[LineC] is null AND n.[LineC] is null))
WHERE  n.Processed = 0

Any ideas on how to UNION or EXISTS this query? This seems to be a pretty common join condition I'm encountering.

Here's a JOIN condition: we should join two tables on a common key, also treating NULL values as equal. Normal equality conditions don't work here: in SQL, NULL values are not considered equal (strictly speaking, their equality is a boolean NULL rather than true or false).

ANSI standard defines a special predicate for that: IS NOT DISTINCT FROM, which is supported by PostgreSQL. This treats values the same value as DISTINCT does: two equal values, as well as two NULL values, are considered not distinct. MySQL also supports similar operator, <=>. Unfortunately, SQL Server supports neither.

In SQL Server, we can emulate this by using set operators which distinctify the queries. Set operators work on sets, so we would need some kind of a SELECT query for them. This would also help us to overcome SQL Server's inability to compare tuples directly. Basically, we need to tell if two tuples: (a.lineA, a.lineB, a.lineC) and (n.lineA, n.lineB, n.lineC) are distinct or not.

We can do that by selecting those tuples in dummy queries (those without a FROM clause), intersecting those queries (which would give us a one-record or zero-record resultset, depending on whether the two tuples are distinct or not) and applying EXISTS to the resultset. EXISTS is a boolean predicate and as such can serve as a join condition.

Let's check it. We'll create two tables, fill them with random values (including occasional NULLs here and there):

CREATE TABLE
a
(
id INT NOT NULL PRIMARY KEY IDENTITY,
lineA VARCHAR(MAX),
lineB VARCHAR(MAX),
lineC VARCHAR(MAX)
)

CREATE TABLE
n
(
id INT NOT NULL PRIMARY KEY IDENTITY,
lineA VARCHAR(MAX),
lineB VARCHAR(MAX),
lineC VARCHAR(MAX)
)

DECLARE @count INT = 0

SELECT RAND(0.20130508)

WHILE @count &lt;= 10000
BEGIN
        INSERT
        INTO    a (lineA, lineB, lineC)
        VALUES  (
                'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END
)

INSERT
INTO    n (lineA, lineB, lineC)
VALUES  (
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END
)

SET @count = @count + 1
END

and join them:

SELECT  *
FROM    a
JOIN    n
ON      EXISTS
(
SELECT  a.lineA, a.lineB, a.lineC
INTERSECT
SELECT  n.lineA, n.lineB, n.lineC
)

id	lineA	lineB	lineC	id	lineA	lineB	lineC
8038			string84	4494			string84
2726		string49	string51	1262		string49	string51
14		string73	string97	6239		string73	string97
7086	string101	string110	string99	7616	string101	string110	string99
976	string103	string129	string14	7176	string103	string129	string14
1030	string111	string145	string105	4897	string111	string145	string105
1010	string112	string142	string195	1670	string112	string142	string195
3525	string130	string117	string7	5747	string130	string117	string7
8515	string134	string2	string182	1250	string134	string2	string182
293	string17	string118	string16	9588	string17	string118	string16
3339	string18	string76	string83	6817	string18	string76	string83
239	string181	string18	string88	497	string181	string18	string88
4343	string196		string72	5396	string196		string72
4947	string23		string155	7858	string23		string155
6833	string23		string182	225	string23		string182
6453	string54	string191	string166	1199	string54	string191	string166
8539	string57	string184	string134	6607	string57	string184	string134
6452	string6	string67		9161	string6	string67
1068	string68	string146	string109	5068	string68	string146	string109
19 rows fetched in 0.0008s (0.5156s)

[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 19, logical reads 115, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'n'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'a'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 516 ms,  elapsed time = 520 ms.

  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC]), RESIDUAL:([ee].[dbo].[a].[lineA] = [ee].[dbo].[n].[lineA] AND [ee].[dbo].[a].[lineB] = [ee].[dbo].[n].[lineB] AND [ee].[dbo].[a].[lineC] = [ee].[dbo].[n].[lineC]))
       |--Sort(ORDER BY:([ee].[dbo].[a].[lineA] ASC, [ee].[dbo].[a].[lineB] ASC, [ee].[dbo].[a].[lineC] ASC))
       |    |--Clustered Index Scan(OBJECT:([ee].[dbo].[a].[PK__a__3213E83F76CD3E06]))
       |--Sort(ORDER BY:([ee].[dbo].[n].[lineA] ASC, [ee].[dbo].[n].[lineB] ASC, [ee].[dbo].[n].[lineC] ASC))
            |--Clustered Index Scan(OBJECT:([ee].[dbo].[n].[PK__n__3213E83F7A9DCEEA]))

Note that the query uses Merge Join which is only possible on equality conditions.

Indeed, we see this in the plan:

([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC])

which is good old tuple comparison. As it turns out, SQL Server is capable of doing this behind the scenes, it's just something it does not implement in the front end.

Note that if we put original join clause here, it will yield exactly same plan:

SELECT  *
FROM    a
JOIN    n
ON      (a.lineA = n.lineA OR (a.lineA IS NULL AND n.lineA IS NULL))
AND (a.lineB = n.lineB OR (a.lineB IS NULL AND n.lineB IS NULL))
AND (a.lineC = n.lineC OR (a.lineC IS NULL AND n.lineC IS NULL))

id	lineA	lineB	lineC	id	lineA	lineB	lineC
8038			string84	4494			string84
2726		string49	string51	1262		string49	string51
14		string73	string97	6239		string73	string97
7086	string101	string110	string99	7616	string101	string110	string99
976	string103	string129	string14	7176	string103	string129	string14
1030	string111	string145	string105	4897	string111	string145	string105
1010	string112	string142	string195	1670	string112	string142	string195
3525	string130	string117	string7	5747	string130	string117	string7
8515	string134	string2	string182	1250	string134	string2	string182
293	string17	string118	string16	9588	string17	string118	string16
3339	string18	string76	string83	6817	string18	string76	string83
239	string181	string18	string88	497	string181	string18	string88
4343	string196		string72	5396	string196		string72
4947	string23		string155	7858	string23		string155
6833	string23		string182	225	string23		string182
6453	string54	string191	string166	1199	string54	string191	string166
8539	string57	string184	string134	6607	string57	string184	string134
6452	string6	string67		9161	string6	string67
1068	string68	string146	string109	5068	string68	string146	string109
19 rows fetched in 0.0009s (0.5156s)

[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 19, logical reads 115, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'n'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'a'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 523 ms.

  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC]), RESIDUAL:([ee].[dbo].[a].[lineA] = [ee].[dbo].[n].[lineA] AND [ee].[dbo].[a].[lineB] = [ee].[dbo].[n].[lineB] AND [ee].[dbo].[a].[lineC] = [ee].[dbo].[n].[lineC]))
       |--Sort(ORDER BY:([ee].[dbo].[a].[lineA] ASC, [ee].[dbo].[a].[lineB] ASC, [ee].[dbo].[a].[lineC] ASC))
       |    |--Clustered Index Scan(OBJECT:([ee].[dbo].[a].[PK__a__3213E83F76CD3E06]))
       |--Sort(ORDER BY:([ee].[dbo].[n].[lineA] ASC, [ee].[dbo].[n].[lineB] ASC, [ee].[dbo].[n].[lineC] ASC))
            |--Clustered Index Scan(OBJECT:([ee].[dbo].[n].[PK__n__3213E83F7A9DCEEA]))

, so this trick is a pure syntactic sugar. However, it's a good replacement for IS DISTINCT FROM / IS NOT DISTINCT FROM and will make your queries look prettier.

Written by Quassnoi

May 8th, 2013 at 11:00 pm

Posted in SQL Server

One Response to 'SQL Server: joining NULL values'

Subscribe to comments with RSS

Nice trick, excellent article. Thank you very much.

Minh

16 May 13 at 04:03

EXPLAIN EXTENDED