From Stack Overflow:
I have a case where I want to choose any database entry that have an invalid Country
, Region
, or Area
ID
.
By invalid, I mean an ID
for a country or region or area that no longer exists in my tables.
I have four tables: Properties
, Countries
, Regions
, Areas
.
I was thinking to do it like this:
SELECT *
FROM Properties
WHERE CountryID NOT IN
(
SELECT CountryID
FROM Countries
)
OR
RegionID NOT IN
(
SELECT RegionID
FROM Regions
)
OR
AreaID NOT IN
(
SELECT AreaID
FROM Areas
)
Now, is my query right?
We have two options here:
- Use a
LEFT OUTER JOIN
with the three dimension tables (Countries
, Regions
and Areas
), filtering out non-NULL
values
- Use a
NOT IN
/ NOT EXISTS
approach shown above
LEFT JOIN
intuitively seems to be more elegant and more efficient than subqueries. But is it really?
What happens on the LEFT JOIN
?
All four tables are joined on the corresponding keys. This means that Products
will be joined with Countries
, then the resulting set will be joined with Regions
, and what is left will be joined with Areas
.
Practice shows that despite all user's efforts most fields tend to be filled properly. That's why the optimizer will see a good probabilty of all three resultsets to be large, and will, therefore, use a HASH JOIN
for all three joins.
HASH JOIN
means that we will not be able to skip some tables. We will need to read all three of them and do perform the whole JOIN
.
Now, what's about the subqueries?
In this case, Properties
will be a leading table.
For each row, three conditions will be checked, and at most one row will be selected from any of the tables.
If any of the conditions succeeds (e. g. there are no Country
for a Property
), then the optimizer will use a shortcut and won't even bother to check for subsequent conditions (a Region
or an Area
).
So, after all, the subquery approach seems to be better.
Let's create the sample tables and see if it's actually is:
Table creation details
SET NOCOUNT ON
GO
CREATE SCHEMA [20090414_missing]
CREATE TABLE Properties (PropertyID INT NOT NULL PRIMARY KEY, CountryID INT NOT NULL, RegionID INT NOT NULL, AreaID INT NOT NULL, name VARCHAR(200))
CREATE TABLE Countries (CountryID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
CREATE TABLE Regions (RegionID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
CREATE TABLE Areas (AreaID INT NOT NULL PRIMARY KEY, name VARCHAR(200))
GO
CREATE INDEX IX_Property_Country ON [20090414_missing].Properties (CountryID)
CREATE INDEX IX_Property_Region ON [20090414_missing].Properties (RegionID)
CREATE INDEX IX_Property_Area ON [20090414_missing].Properties (AreaID)
GO
DECLARE @cnt INT;
SET @cnt = 1;
BEGIN TRANSACTION
WHILE @cnt <= 524288
BEGIN
INSERT
INTO [20090414_missing].Properties
VALUES (@cnt, @cnt, @cnt, @cnt, 'Property ' + CAST(@cnt AS VARCHAR))
SET @cnt = @cnt + 1
END
INSERT
INTO [20090414_missing].Countries
SELECT CountryID, 'Country ' + CAST(CountryID AS VARCHAR)
FROM [20090414_missing].Properties
INSERT
INTO [20090414_missing].Regions
SELECT RegionID, 'Country ' + CAST(RegionID AS VARCHAR)
FROM [20090414_missing].Properties
INSERT
INTO [20090414_missing].Areas
SELECT AreaID, 'Country ' + CAST(AreaID AS VARCHAR)
FROM [20090414_missing].Properties
COMMIT
GO
[/sourcecode]
</div>
This is the worst case in terms of execution time: all <code>Properties</code> have corresponding <code>Countries</code>, <code>Regions</code> and <code>Areas</code>.
Let's try the <code>LEFT JOIN</code>'s first:
SELECT PropertyID
FROM [20090414_missing].Properties p
LEFT JOIN
[20090414_missing].Countries c
ON c.CountryID = p.CountryID
LEFT JOIN
[20090414_missing].Regions r
ON r.RegionID = p.RegionID
LEFT JOIN
[20090414_missing].Areas a
ON a.AreaID = p.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL
PropertyID |
No rows selected |
00:00:57 |
Almost a minute.
Now the subqueries:
SELECT PropertyID
FROM [20090414_missing].Properties
WHERE CountryID NOT IN
(
SELECT CountryID
FROM [20090414_missing].Countries
)
OR
RegionID NOT IN
(
SELECT RegionID
FROM [20090414_missing].Regions
)
OR
AreaID NOT IN
(
SELECT AreaID
FROM [20090414_missing].Areas
)
PropertyID |
No rows selected |
00:00:11 |
11 seconds, much nicer.
And if the invalid entries will ever happen, the latter query will sufficiently skip them, making overall execution time even shorter.
For this specific query the NOT IN
solution, proposed by its author, seems to be more efficient than LEFT JOIN
's.
Leave a Reply