EXPLAIN EXTENDED

How to create fast database queries

Missing entries

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a case where I want to choose any database entry that have an invalid Country, Region, or Area ID.

By invalid, I mean an ID for a country or region or area that no longer exists in my tables.

I have four tables: Properties, Countries, Regions, Areas.

I was thinking to do it like this:

SELECT  *
FROM    Properties
WHERE   CountryID NOT IN 
        (
        SELECT  CountryID
        FROM    Countries
        )
        OR
        RegionID NOT IN
        (
        SELECT  RegionID
        FROM    Regions
        )
        OR
        AreaID NOT IN
        (
        SELECT  AreaID
        FROM    Areas
        )

Now, is my query right?

We have two options here:

  1. Use a LEFT OUTER JOIN with the three dimension tables (Countries, Regions and Areas), filtering out non-NULL values
  2. Use a NOT IN / NOT EXISTS approach shown above

LEFT JOIN intuitively seems to be more elegant and more efficient than subqueries. But is it really?

What happens on the LEFT JOIN?

All four tables are joined on the corresponding keys. This means that Products will be joined with Countries, then the resulting set will be joined with Regions, and what is left will be joined with Areas.

Practice shows that despite all user's efforts most fields tend to be filled properly. That's why the optimizer will see a good probabilty of all three resultsets to be large, and will, therefore, use a HASH JOIN for all three joins.

HASH JOIN means that we will not be able to skip some tables. We will need to read all three of them and do perform the whole JOIN.

Now, what's about the subqueries?

In this case, Properties will be a leading table.

For each row, three conditions will be checked, and at most one row will be selected from any of the tables.

If any of the conditions succeeds (e. g. there are no Country for a Property), then the optimizer will use a shortcut and won't even bother to check for subsequent conditions (a Region or an Area).

So, after all, the subquery approach seems to be better.

Let's create the sample tables and see if it's actually is:

Table creation details

Written by Quassnoi

April 14th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply