EXPLAIN EXTENDED

How to create fast database queries

Searching for NULL columns

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to make sure I'm not inserting a duplicate row into my table (i. e. only PRIMARY KEY different).

All my fields allow NULL's as I've decided NULL to mean all values.

Because of NULL's, the following statement in my stored procedure can't work:

IF EXISTS
        (
        SELECT  *
        FROM    MY_TABLE
        WHERE   MY_FIELD1 = @IN_MY_FIELD1
                AND MY_FIELD2 = @IN_MY_FIELD2
                AND MY_FIELD3 = @IN_MY_FIELD3 
                AND MY_FIELD4 = @IN_MY_FIELD4
                AND MY_FIELD5 = @IN_MY_FIELD5
                AND MY_FIELD6 = @IN_MY_FIELD6
        )
BEGIN
        goto on_duplicate
END

, since NULL = NULL is not true.

How can I check for the duplicates without having an IF ISNULL statement for every column?

First of all, can't help being a smartass and tell that a UNIQUE index should be defined over these columns.

SQL Server does index NULL values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.

Now, back to business. Of course there are cases when creating such an index is not acceptable.

In these cases a very nice but little known operator INTERSECT comes handy.

This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server's understands common: each value in each column should be either equal to one in the other row or they both should be NULL's.

Let's create a sample table:

CREATE SCHEMA [20090704_intersect]
CREATE TABLE t_fieldset (
        id INT NOT NULL PRIMARY KEY,
        field1 INT,
        field2 INT,
        field3 INT,
        value VARCHAR(100) NOT NULL
)
GO
BEGIN TRANSACTION
SELECT  RAND(20090704)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        INSERT
        INTO    &#91;20090704_intersect&#93;.t_fieldset
        VALUES  (
                @cnt,
                NULLIF(FLOOR(RAND() * 100), 0),
                NULLIF(FLOOR(RAND() * 100), 0),
                NULLIF(FLOOR(RAND() * 100), 0),
                LEFT('Value ' + CAST(@cnt AS VARCHAR) + REPLICATE('-', 100), 100)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
&#91;/sourcecode&#93;

A <code>1,000,000</code> rows and three unindexed fields here, filled with random values.

First, let's issue an ugly query using <code>IFNULL</code>'s:


DECLARE @field1 INT
DECLARE @field2 INT
DECLARE @field3 INT
SET     @field1 = 42
SET     @field2 = NULL
SET     @field3 = NULL
SELECT  *
FROM    [20090704_intersect].t_fieldset
WHERE   (field1 = @field1 OR (field1 IS NULL AND @field1 IS NULL))
        AND (field2 = @field2 OR (field2 IS NULL AND @field2 IS NULL))
        AND (field3 = @field3 OR (field3 IS NULL AND @field3 IS NULL))

id field1 field2 field3 value
13961 42 Value 13961-----------------------------------------------------------------------------------------
1 row fetched in 0.0003s (0.7318s)
Table 't_fieldset'. Scan count 3, logical reads 19673, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 656 ms,  elapsed time = 728 ms. 

It runs for more than 0.7 seconds with the following plan:

SELECT
  Clustered Index Scan ([20090704_intersect].[t_fieldset])

Now, let's use INTERSECT.

It requires a query, so we'll wrap the parameters into a SELECT statement:

DECLARE @field1 INT
DECLARE @field2 INT
DECLARE @field3 INT
SET     @field1 = 42
SET     @field2 = NULL
SET     @field3 = NULL
SELECT  *
FROM    [20090704_intersect].t_fieldset
WHERE   EXISTS
        (
        SELECT  field1, field2, field3
        INTERSECT
        SELECT  @field1, @field2, @field3
        )

id field1 field2 field3 value
13961 42 Value 13961-----------------------------------------------------------------------------------------
1 row fetched in 0.0003s (0.3195s)
Table 't_fieldset'. Scan count 3, logical reads 19673, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 635 ms,  elapsed time = 314 ms. 

Same result, but the execution plan is slightly different:

SELECT
  Parallelism (Gather Streams)
    Clustered Index Scan ([20090704_intersect].[t_fieldset])

We see that INTERSECT, unlike IFNULL's, enables parallel execution, and makes this query twice as efficient on my dual-core machine.

Written by Quassnoi

July 4th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply