SQL Server: performance of LEFT JOIN versus NOT IN at EXPLAIN EXTENDED

I have a case where I want to choose any database entry that have an invalid Country, Region, or Area ID.

By invalid, I mean an ID for a country or region or area that no longer exists in my tables.

I have four tables: Properties, Countries, Regions, Areas.

I was thinking to do it like this:
SELECT  *
FROM    Properties
WHERE   CountryID NOT IN 
        (
        SELECT  CountryID
        FROM    Countries
        )
        OR
        RegionID NOT IN
        (
        SELECT  RegionID
        FROM    Regions
        )
        OR
        AreaID NOT IN
        (
        SELECT  AreaID
        FROM    Areas
        )
Now, is my query right?

We have two options here:

Use a LEFT OUTER JOIN with the three dimension tables (Countries, Regions and Areas), filtering out non-NULL values
Use a NOT IN / NOT EXISTS approach shown above

LEFT JOIN intuitively seems to be more elegant and more efficient than subqueries. But is it really?

What happens on the LEFT JOIN?

All four tables are joined on the corresponding keys. This means that Products will be joined with Countries, then the resulting set will be joined with Regions, and what is left will be joined with Areas.

Practice shows that despite all user's efforts most fields tend to be filled properly. That's why the optimizer will see a good probabilty of all three resultsets to be large, and will, therefore, use a HASH JOIN for all three joins.

HASH JOIN means that we will not be able to skip some tables. We will need to read all three of them and do perform the whole JOIN.

Now, what's about the subqueries?

In this case, Properties will be a leading table.

For each row, three conditions will be checked, and at most one row will be selected from any of the tables.

If any of the conditions succeeds (e. g. there are no Country for a Property), then the optimizer will use a shortcut and won't even bother to check for subsequent conditions (a Region or an Area).

So, after all, the subquery approach seems to be better.

Let's create the sample tables and see if it's actually is:

Table creation details

SET NOCOUNT ON
GO
CREATE SCHEMA [20090414_missing]
CREATE TABLE Properties (PropertyID INT NOT NULL PRIMARY KEY, CountryID INT NOT NULL, RegionID INT NOT NULL, AreaID INT NOT NULL, name VARCHAR(200))
CREATE TABLE Countries (CountryID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
CREATE TABLE Regions (RegionID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
CREATE TABLE Areas (AreaID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
GO
CREATE INDEX IX_Property_Country ON [20090414_missing].Properties (CountryID)
CREATE INDEX IX_Property_Region ON [20090414_missing].Properties (RegionID)
CREATE INDEX IX_Property_Area ON [20090414_missing].Properties (AreaID)
GO
DECLARE @cnt INT;
SET @cnt = 1;
BEGIN TRANSACTION
WHILE @cnt <= 524288
BEGIN
        INSERT
        INTO    &#91;20090414_missing&#93;.Properties
        VALUES  (@cnt, @cnt, @cnt, @cnt, 'Property ' + CAST(@cnt AS VARCHAR))
        SET @cnt = @cnt + 1
END
INSERT
INTO    &#91;20090414_missing&#93;.Countries
SELECT  CountryID, 'Country ' + CAST(CountryID AS VARCHAR)
FROM    &#91;20090414_missing&#93;.Properties
INSERT
INTO    &#91;20090414_missing&#93;.Regions
SELECT  RegionID, 'Country ' + CAST(RegionID AS VARCHAR)
FROM    &#91;20090414_missing&#93;.Properties
INSERT
INTO    &#91;20090414_missing&#93;.Areas
SELECT  AreaID, 'Country ' + CAST(AreaID AS VARCHAR)
FROM    &#91;20090414_missing&#93;.Properties
COMMIT
GO
&#91;/sourcecode&#93;
</div>

This is the worst case in terms of execution time: all <code>Properties</code> have corresponding <code>Countries</code>, <code>Regions</code> and <code>Areas</code>.

Let's try the <code>LEFT JOIN</code>'s first:


SELECT  PropertyID
FROM    [20090414_missing].Properties p
LEFT JOIN
        [20090414_missing].Countries c
ON      c.CountryID = p.CountryID
LEFT JOIN
        [20090414_missing].Regions r
ON      r.RegionID = p.RegionID
LEFT JOIN
        [20090414_missing].Areas a
ON      a.AreaID = p.AreaID
WHERE   c.CountryID IS NULL
        OR r.RegionID IS NULL
        OR a.AreaID IS NULL

PropertyID
No rows selected
00:00:57

Almost a minute.

Now the subqueries:

SELECT  PropertyID
FROM    [20090414_missing].Properties
WHERE   CountryID NOT IN 
        (
        SELECT  CountryID
        FROM    [20090414_missing].Countries
        )
        OR
        RegionID NOT IN
        (
        SELECT  RegionID
        FROM    [20090414_missing].Regions
        )
        OR
        AreaID NOT IN
        (
        SELECT  AreaID
        FROM    [20090414_missing].Areas
        )

PropertyID
No rows selected
00:00:11

11 seconds, much nicer.

And if the invalid entries will ever happen, the latter query will sufficiently skip them, making overall execution time even shorter.

For this specific query the NOT IN solution, proposed by its author, seems to be more efficient than LEFT JOIN's.

Written by Quassnoi

April 14th, 2009 at 11:00 pm

Posted in SQL Server

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

EXPLAIN EXTENDED

Missing entries

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Missing entries

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow