Missing entries
From Stack Overflow:
I have a case where I want to choose any database entry that have an invalid
Country,Region, orAreaID.By invalid, I mean an
IDfor a country or region or area that no longer exists in my tables.I have four tables:
Properties,Countries,Regions,Areas.I was thinking to do it like this:
SELECT * FROM Properties WHERE CountryID NOT IN ( SELECT CountryID FROM Countries ) OR RegionID NOT IN ( SELECT RegionID FROM Regions ) OR AreaID NOT IN ( SELECT AreaID FROM Areas )Now, is my query right?
We have two options here:
- Use a
LEFT OUTER JOINwith the three dimension tables (Countries,RegionsandAreas), filtering out non-NULLvalues - Use a
NOT IN/NOT EXISTSapproach shown above
LEFT JOIN intuitively seems to be more elegant and more efficient than subqueries. But is it really?
What happens on the LEFT JOIN?
All four tables are joined on the corresponding keys. This means that Products will be joined with Countries, then the resulting set will be joined with Regions, and what is left will be joined with Areas.
Practice shows that despite all user’s efforts most fields tend to be filled properly. That’s why the optimizer will see a good probabilty of all three resultsets to be large, and will, therefore, use a HASH JOIN for all three joins.
HASH JOIN means that we will not be able to skip some tables. We will need to read all three of them and do perform the whole JOIN.
Now, what’s about the subqueries?
In this case, Properties will be a leading table.
For each row, three conditions will be checked, and at most one row will be selected from any of the tables.
If any of the conditions succeeds (e. g. there are no Country for a Property), then the optimizer will use a shortcut and won’t even bother to check for subsequent conditions (a Region or an Area).
So, after all, the subquery approach seems to be better.
Let’s create the sample tables and see if it’s actually is:
This is the worst case in terms of execution time: all Properties have corresponding Countries, Regions and Areas.
Let’s try the LEFT JOIN‘s first:
SELECT PropertyID
FROM [20090414_missing].Properties p
LEFT JOIN
[20090414_missing].Countries c
ON c.CountryID = p.CountryID
LEFT JOIN
[20090414_missing].Regions r
ON r.RegionID = p.RegionID
LEFT JOIN
[20090414_missing].Areas a
ON a.AreaID = p.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL
| PropertyID | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No rows selected | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 00:00:57 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Almost a minute.
Now the subqueries:
SELECT PropertyID
FROM [20090414_missing].Properties
WHERE CountryID NOT IN
(
SELECT CountryID
FROM [20090414_missing].Countries
)
OR
RegionID NOT IN
(
SELECT RegionID
FROM [20090414_missing].Regions
)
OR
AreaID NOT IN
(
SELECT AreaID
FROM [20090414_missing].Areas
)
| PropertyID | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No rows selected | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 00:00:11 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 seconds, much nicer.
And if the invalid entries will ever happen, the latter query will sufficiently skip them, making overall execution time even shorter.
For this specific query the NOT IN solution, proposed by its author, seems to be more efficient than LEFT JOIN‘s.
Subscribe in a reader