EXPLAIN EXTENDED

How to create fast database queries

SQL Server: joining NULL values

with one comment

From Stack Overflow:

UPDATE  n
SET    AddressID = a.AddressID
FROM   #NewAddress n
JOIN   dbo.Address a
ON     (a.[LineA] = n.[LineA] OR (a.[LineA] is null AND n.[LineA] is null))
AND (a.[LineB] = n.[LineB] OR (a.[LineB] is null AND n.[LineB] is null))
AND (a.[LineC] = n.[LineC] OR (a.[LineC] is null AND n.[LineC] is null))
WHERE  n.Processed = 0

Any ideas on how to UNION or EXISTS this query? This seems to be a pretty common join condition I'm encountering.

Here's a JOIN condition: we should join two tables on a common key, also treating NULL values as equal. Normal equality conditions don't work here: in SQL, NULL values are not considered equal (strictly speaking, their equality is a boolean NULL rather than true or false).

ANSI standard defines a special predicate for that: IS NOT DISTINCT FROM, which is supported by PostgreSQL. This treats values the same value as DISTINCT does: two equal values, as well as two NULL values, are considered not distinct. MySQL also supports similar operator, <=>. Unfortunately, SQL Server supports neither.

In SQL Server, we can emulate this by using set operators which distinctify the queries. Set operators work on sets, so we would need some kind of a SELECT query for them. This would also help us to overcome SQL Server's inability to compare tuples directly. Basically, we need to tell if two tuples: (a.lineA, a.lineB, a.lineC) and (n.lineA, n.lineB, n.lineC) are distinct or not.

We can do that by selecting those tuples in dummy queries (those without a FROM clause), intersecting those queries (which would give us a one-record or zero-record resultset, depending on whether the two tuples are distinct or not) and applying EXISTS to the resultset. EXISTS is a boolean predicate and as such can serve as a join condition.

Let's check it. We'll create two tables, fill them with random values (including occasional NULLs here and there):

CREATE TABLE
a
(
id INT NOT NULL PRIMARY KEY IDENTITY,
lineA VARCHAR(MAX),
lineB VARCHAR(MAX),
lineC VARCHAR(MAX)
)

CREATE TABLE
n
(
id INT NOT NULL PRIMARY KEY IDENTITY,
lineA VARCHAR(MAX),
lineB VARCHAR(MAX),
lineC VARCHAR(MAX)
)

DECLARE @count INT = 0

SELECT RAND(0.20130508)

WHILE @count &lt;= 10000
BEGIN
        INSERT
        INTO    a (lineA, lineB, lineC)
        VALUES  (
                'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END
)

INSERT
INTO    n (lineA, lineB, lineC)
VALUES  (
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END,
'string' + CASE WHEN RAND() &gt; 0.03 THEN CAST(FLOOR(RAND() * 200) AS VARCHAR) END
)

SET @count = @count + 1
END

and join them:

SELECT  *
FROM    a
JOIN    n
ON      EXISTS
(
SELECT  a.lineA, a.lineB, a.lineC
INTERSECT
SELECT  n.lineA, n.lineB, n.lineC
)

id lineA lineB lineC id lineA lineB lineC
8038 string84 4494 string84
2726 string49 string51 1262 string49 string51
14 string73 string97 6239 string73 string97
7086 string101 string110 string99 7616 string101 string110 string99
976 string103 string129 string14 7176 string103 string129 string14
1030 string111 string145 string105 4897 string111 string145 string105
1010 string112 string142 string195 1670 string112 string142 string195
3525 string130 string117 string7 5747 string130 string117 string7
8515 string134 string2 string182 1250 string134 string2 string182
293 string17 string118 string16 9588 string17 string118 string16
3339 string18 string76 string83 6817 string18 string76 string83
239 string181 string18 string88 497 string181 string18 string88
4343 string196 string72 5396 string196 string72
4947 string23 string155 7858 string23 string155
6833 string23 string182 225 string23 string182
6453 string54 string191 string166 1199 string54 string191 string166
8539 string57 string184 string134 6607 string57 string184 string134
6452 string6 string67 9161 string6 string67
1068 string68 string146 string109 5068 string68 string146 string109
19 rows fetched in 0.0008s (0.5156s)
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 19, logical reads 115, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'n'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'a'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 516 ms,  elapsed time = 520 ms. 
  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC]), RESIDUAL:([ee].[dbo].[a].[lineA] = [ee].[dbo].[n].[lineA] AND [ee].[dbo].[a].[lineB] = [ee].[dbo].[n].[lineB] AND [ee].[dbo].[a].[lineC] = [ee].[dbo].[n].[lineC]))
       |--Sort(ORDER BY:([ee].[dbo].[a].[lineA] ASC, [ee].[dbo].[a].[lineB] ASC, [ee].[dbo].[a].[lineC] ASC))
       |    |--Clustered Index Scan(OBJECT:([ee].[dbo].[a].[PK__a__3213E83F76CD3E06]))
       |--Sort(ORDER BY:([ee].[dbo].[n].[lineA] ASC, [ee].[dbo].[n].[lineB] ASC, [ee].[dbo].[n].[lineC] ASC))
            |--Clustered Index Scan(OBJECT:([ee].[dbo].[n].[PK__n__3213E83F7A9DCEEA]))

Note that the query uses Merge Join which is only possible on equality conditions.

Indeed, we see this in the plan:

([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC])

which is good old tuple comparison. As it turns out, SQL Server is capable of doing this behind the scenes, it's just something it does not implement in the front end.

Note that if we put original join clause here, it will yield exactly same plan:

SELECT  *
FROM    a
JOIN    n
ON      (a.lineA = n.lineA OR (a.lineA IS NULL AND n.lineA IS NULL))
AND (a.lineB = n.lineB OR (a.lineB IS NULL AND n.lineB IS NULL))
AND (a.lineC = n.lineC OR (a.lineC IS NULL AND n.lineC IS NULL))

id lineA lineB lineC id lineA lineB lineC
8038 string84 4494 string84
2726 string49 string51 1262 string49 string51
14 string73 string97 6239 string73 string97
7086 string101 string110 string99 7616 string101 string110 string99
976 string103 string129 string14 7176 string103 string129 string14
1030 string111 string145 string105 4897 string111 string145 string105
1010 string112 string142 string195 1670 string112 string142 string195
3525 string130 string117 string7 5747 string130 string117 string7
8515 string134 string2 string182 1250 string134 string2 string182
293 string17 string118 string16 9588 string17 string118 string16
3339 string18 string76 string83 6817 string18 string76 string83
239 string181 string18 string88 497 string181 string18 string88
4343 string196 string72 5396 string196 string72
4947 string23 string155 7858 string23 string155
6833 string23 string182 225 string23 string182
6453 string54 string191 string166 1199 string54 string191 string166
8539 string57 string184 string134 6607 string57 string184 string134
6452 string6 string67 9161 string6 string67
1068 string68 string146 string109 5068 string68 string146 string109
19 rows fetched in 0.0009s (0.5156s)
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 19, logical reads 115, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'n'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'a'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 523 ms. 
  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([ee].[dbo].[a].[lineA], [ee].[dbo].[a].[lineB], [ee].[dbo].[a].[lineC])=([ee].[dbo].[n].[lineA], [ee].[dbo].[n].[lineB], [ee].[dbo].[n].[lineC]), RESIDUAL:([ee].[dbo].[a].[lineA] = [ee].[dbo].[n].[lineA] AND [ee].[dbo].[a].[lineB] = [ee].[dbo].[n].[lineB] AND [ee].[dbo].[a].[lineC] = [ee].[dbo].[n].[lineC]))
       |--Sort(ORDER BY:([ee].[dbo].[a].[lineA] ASC, [ee].[dbo].[a].[lineB] ASC, [ee].[dbo].[a].[lineC] ASC))
       |    |--Clustered Index Scan(OBJECT:([ee].[dbo].[a].[PK__a__3213E83F76CD3E06]))
       |--Sort(ORDER BY:([ee].[dbo].[n].[lineA] ASC, [ee].[dbo].[n].[lineB] ASC, [ee].[dbo].[n].[lineC] ASC))
            |--Clustered Index Scan(OBJECT:([ee].[dbo].[n].[PK__n__3213E83F7A9DCEEA]))

, so this trick is a pure syntactic sugar. However, it's a good replacement for IS DISTINCT FROM / IS NOT DISTINCT FROM and will make your queries look prettier.

Written by Quassnoi

May 8th, 2013 at 11:00 pm

Posted in SQL Server

One Response to 'SQL Server: joining NULL values'

Subscribe to comments with RSS

  1. Nice trick, excellent article. Thank you very much.

    Minh

    16 May 13 at 04:03

Leave a Reply