Author Archive
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Finally, it's MySQL time.
As always, we will create the sample tables:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Today, we will see how Oracle copes with these queries.
And to do this, we, of course, should create sample tables:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Let's see for PostgreSQL 8.4.
To do this we will create two sample tables:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.
Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Differences between the methods
These methods are quite different.
First of all, LEFT JOIN / IS NULL
and NOT EXISTS
are semantically equivalent, while NOT IN
is not. These method differ in how they handle NULL
values in t_right
LEFT JOIN
is guaranteed to return every row from t_left
, and then filtering is applied to the values returned from t_right
. If for some row in t_left
there is no corresponding row in t_right
(which means no row with that exact value is present in t_right
), the row from t_left
will be returned once, and the NULL
values will be substituted instead of t_right
's actual values.
Since NULL
values can never satisfy an equality JOIN
condition, the NULL
values returned by the query are guaranteed to be substituted by the LEFT JOIN
, not fetched out of the actual t_right
's row. This means that LEFT JOIN / IS NULL
is guaranteed to return at most one row from t_left
, and these row's value
is not equal to one of those in t_right
.
The same holds for NOT EXISTS
. Since it's a predicate, not a JOIN
condition, the rows from t_left
can only be returned at most once too. EXISTS
always returns TRUE
or FALSE
and it will return TRUE
as soon as it finds only a single matching row in t_right
, or FALSE
, if it find none.
NOT EXISTS
, therefore, will return TRUE
only if no row satisfying the equality condition is found in t_right
(same as for LEFT JOIN / IS NULL
).
Note that NULL
values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL
and NOT EXISTS
will always return rows from t_left
that have value
set to NULL
, even is there are rows with value IS NULL
in t_right
.
NOT IN
, however, behaves differently.
IN
predicate (unlike EXISTS
) is trivalent, i. e. it can return TRUE
, FALSE
or NULL
:
TRUE
is returned when the non-NULL
value in question is found in the listFALSE
is returned when the non-NULL
value is not found in the list and the list does not containNULL
valuesNULL
is returned when the value isNULL
, or the non-NULL
value is not found in the list and the list contains at least oneNULL
value
IN
predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL
values on either side of the expression, returning NULL
instead.
This of course makes no difference when using the positive form of NULL
: predicates returning NULL
are filtered out by the WHERE
clause as well as those returning FALSE
.
However, NOT IN
is different, since negation of NULL
is NULL
as well.
That's why NOT IN
condition will never hold for any list with a NULL
value in it.
- If a row is found in the list,
IN
will returnTRUE
andNOT IN
, therefore, will returnFALSE
- If a row is not found in the list,
IN
will returnNULL
, andNOT IN
on its turn will also returnNULL
Both conditions will of course be filtered out by the WHERE
clause.
Let's illustrate it with two simple queries that compare (1, NULL)
in t_left with (2, NULL)
in t_right
:
WITH t_left AS ( SELECT 1 AS value UNION ALL SELECT NULL ), t_right AS ( SELECT 2 AS value UNION ALL SELECT NULL ) SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NULL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0001s (0.0006s) |
This query, using NOT EXISTS
, returns both values from t_left
, since neither of them is equal to any of the values from t_right
.
WITH t_left AS ( SELECT 1 AS value UNION ALL SELECT NULL ), t_right AS ( SELECT 2 AS value UNION ALL SELECT NULL ) SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right )
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 rows fetched in 0.0001s (0.0005s) |
This query, on the other hand, returns nothing. Since there is a NULL
in t_right
, NOT IN
returns NULL
rather than TRUE
if the value is not found among the defined values. Just in case.
IN
(and NOT IN
) are too chicken to say something definite about lists with NULL
unless they are completely sure that the value is there.
However, if the values in both tables are non-nullable, NULL
, all three method describe above are semantically identical.
MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Magerio asks:
Hello, thanks for your great post about advanced row sampling, it helped me a lot.
I have a question.
I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).
What if I want to show the best 3 products of each category?
Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER()
in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.
The original article described a query to obtain 15 first id
's for each grouper
. The idea behind this query was quite simple:
- Take a list of groupers using
DISTINCT
- For each grouper, find the 15th
id
usingLIMIT
in a correlated subquery. We have now a set of(grouper, id15)
pairs: for any givengrouper
we have theid
of the 15th row (ordered byid
). - Join this resultset with the main table so that for each
grouper
, theid
's less or equal to the 15thid
are selected. We then will have 15 lowestid
's for eachgrouper
that is exactly what we were after.
This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)
: a function with a single PARTITION BY
expression and a single ORDER BY
expression.
But in your case there are multiple ORDER BY
expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality
). However, the solution above relied on the fact that the field in ORDER BY
was unique, so we now that there are exactly 15 rows with id
less or equal to the id
returned with LIMIT 15
. Since this is most probably not the case with quality, we should include id
into the ORDER BY
clause so that it makes the ordering conidition unique.
A subquery in MySQL, however, can only return a single value.
To work around this, we should use two subqueries that will return us a single column in an ORDER BY
condition.
Let's create a sample table:
Read the rest of this entry »
SQL Server 2005: finding foreign keys that have no index
Comments enabled. I *really* need your comment
From Stack Overflow:
How can I programmatically determine which foreign key are missing indexes?
Concise. Let's try to answer it.
A foreign key involves two tables: a referencing one and a referenced one.
In most databases, some kind of a unique key is required on the referenced table (this may be a PRIMARY KEY
or just a UNIQUE
constraint). This is by design (when we delete a UNIQUE
value, we always delete last instance of this value).
However, indexing the FOREIGN KEY
fields in the referencing table is not required. Database can function even without this index. However, this can make deletes from the referenced table slow (to validate the relationship we should perform a full table lookup on the referenced table). It almost always makes sense to index the FOREIGN KEY
fields.
For an index on the FOREIGN KEY
fields to be of use, its leading key part should consist of all columns that comprise the FOREIGN KEY
, in any order. Since FOREIGN KEY
checks are always equijoins, not range joins, the order of columns does not matter as long as they are leading in the key.
Consider a constraint: FOREIGN KEY (col1, col2) REFERENCES othertable
. Which indexes are good for this foreign key?
Index | Is good? |
---|---|
col1 |
No. Though it's better than nothing, only the part of the foreign key can be used in the index range. |
col1, col2 |
Yes. It's a classical case |
col2, col1 |
Yes. The order does not matter, since we are searching for exact values of both col1 and col2 |
col2, col1, col3 |
Yes. Despite an extra column, this index can be used to filter on col1 and col2 |
col3, col1, col2 |
No. Though both col1 and col2 are indexed, they are not in the leftmost part of the index |
So for each foreign key, we should check the following condition: there exists at least one index whose leading part consists of (and only of) all columns that constitute the key.
In SQL Server 2005, the information about indexed columns is stored in sys.index_column
. Since indexes can be composite, this is a one-to-many relationship, where index_id
and column_id
define the index and the column indexed with it, while index_column_id
defines column's ordinal position in the index.
For an index to satisfy the foreign key, two conditions should be fulfilled:
- The number of columns in the index that are also a part of the foreign key should be equal to the maximal
index_column_id
. This ensures that all columns constitute the leftmost part of the index. - The number of columns in the index that are also a part of the foreign key should be equal to the number of columns in the foreign key. This ensures that all foreign key columns are indexed.
And here's the query:
SELECT fk.name, s.name, o.name, FROM sys.foreign_keys fk JOIN sys.objects o ON o.object_id = fk.parent_object_id JOIN sys.schemas s ON s.schema_id = o.schema_id WHERE EXISTS ( SELECT * FROM sys.index_columns ic WHERE EXISTS ( SELECT NULL FROM sys.foreign_key_columns fkc WHERE fkc.constraint_object_id = fk.object_id AND fkc.parent_object_id = ic.object_id AND fkc.parent_column_id = ic.column_id ) GROUP BY index_id HAVING COUNT(*) = MAX(index_column_id) AND COUNT(*) = ( SELECT COUNT(*) FROM sys.foreign_key_columns fkc WHERE fkc.constraint_object_id = fk.object_id ) )
Now, let's create some sample tables and check this query:
Oracle: generating a list of dates and counting ranges for each date
From Stack Overflow:
I have a table with data such as below:
Group Start Date End Date A 2001.01.01 2001.01.03 A 2001.01.01 2001.01.02 A 2001.01.03 2001.01.03 B 2001.01.01 2001.01.01 I am looking to produce a view that gives a count for each day
We have a list of ranges here, and for each date we should count the number of ranges that contain this date.
To make this query, we will employ one simple fact: the number of ranges containing a given date is the number of ranges started on or before this date minus the number of ranges that ended before this date.
This can easily be calculated using window functions.
Let's create a sample table:
MySQL: matching whole sets
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a database that stores products
available on the marketand productsstill in developmentin two separate tables (market_product
anddev_product
).A third table (
substance
) contains all substances a product can made of.Other two tables (
marked_product_comp
anddev_product_comp
) mantains product compositions.I want to select products still in development that are made of the same ingredients of marketed products.
This means that we need to return every dev_product
that has at least one market_product
with complete substance match.
In the following example:
DevProductID | Substance |
---|---|
1 | Apple |
1 | Banana |
2 | Apple |
2 | Banana |
2 | Cabbage |
MarketProductID | Substance |
---|---|
1 | Apple |
2 | Apple |
2 | Banana |
3 | Apple |
3 | Banana |
3 | Cabbage |
3 | Daikon |
we should return development product 1, since it's made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.
We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.
In one of the previous articles I wrote about a similar task in SQL Server:
However, this used FULL OUTER JOIN
which MySQL lacks.
Let's see how to implement this task in MySQL:
PostgreSQL: selecting items that belong to all categories
Comments enabled. I *really* need your comment
From Stack Overflow:
I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY
and COUNT
. However, this can be made more efficient.
Let's create a sample table:
SQL Server: leading wildcard match using an index
Comments enabled. I *really* need your comment
From Stack Overflow:
I'm joining to a table dozens of different times, and every time, I join (or filter) based on the results of a
SUBSTRING
of one of the columns (it's a string, but left-padded with zeros, and I don't care about the last four digits).As a result, even though this column is indexed and my query would use the index, it does a table scan because the
SUBSTRING
itself isn't indexed, so SQL Server has to calculate it for every row before as it's joining.I'm looking for any ideas on how to speed up this process.
This problem is called leading wildcard match. Though it's possible to implement it, it usually requires indexing all possible combinations of substrings which usually bloats the index.
However, in this very case it's possible to create quite an efficient index. It is possible because the number of characters we need to trim from the right is fixed.
The main idea is to create a computed column which would store trimmed and reversed value of the string, index that column and use this index in search.
Let's create a sample table: