EXPLAIN EXTENDED

How to create fast database queries

Archive for September, 2009

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL

with 4 comments

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

Let's see for PostgreSQL 8.4.

To do this we will create two sample tables:

Read the rest of this entry »

Written by Quassnoi

September 16th, 2009 at 11:00 pm

Posted in PostgreSQL

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

with 26 comments

This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right r
)

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

Differences between the methods

These methods are quite different.

First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right

LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right's actual values.

Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right's row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row's value is not equal to one of those in t_right.

The same holds for NOT EXISTS. Since it's a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.

NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).

Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.

NOT IN, however, behaves differently.

IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:

  • TRUE is returned when the non-NULL value in question is found in the list
  • FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
  • NULL is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.

However, NOT IN is different, since negation of NULL is NULL as well.

That's why NOT IN condition will never hold for any list with a NULL value in it.

  • If a row is found in the list, IN will return TRUE and NOT IN, therefore, will return FALSE
  • If a row is not found in the list, IN will return NULL, and NOT IN on its turn will also return NULL

Both conditions will of course be filtered out by the WHERE clause.

Let's illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

value
1
NULL
2 rows fetched in 0.0001s (0.0006s)

This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right
)

value
0 rows fetched in 0.0001s (0.0005s)

This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.

IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.

However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.

Read the rest of this entry »

Written by Quassnoi

September 15th, 2009 at 11:00 pm

Posted in SQL Server

MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Magerio asks:

Hello, thanks for your great post about advanced row sampling, it helped me a lot.

I have a question.

I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).

What if I want to show the best 3 products of each category?

Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER() in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.

The original article described a query to obtain 15 first id's for each grouper. The idea behind this query was quite simple:

  1. Take a list of groupers using DISTINCT
  2. For each grouper, find the 15th id using LIMIT in a correlated subquery. We have now a set of (grouper, id15) pairs: for any given grouper we have the id of the 15th row (ordered by id).
  3. Join this resultset with the main table so that for each grouper, the id's less or equal to the 15th id are selected. We then will have 15 lowest id's for each grouper that is exactly what we were after.

This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id): a function with a single PARTITION BY expression and a single ORDER BY expression.

But in your case there are multiple ORDER BY expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality). However, the solution above relied on the fact that the field in ORDER BY was unique, so we now that there are exactly 15 rows with id less or equal to the id returned with LIMIT 15. Since this is most probably not the case with quality, we should include id into the ORDER BY clause so that it makes the ordering conidition unique.

A subquery in MySQL, however, can only return a single value.

To work around this, we should use two subqueries that will return us a single column in an ORDER BY condition.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

September 14th, 2009 at 11:00 pm

Posted in MySQL

SQL Server 2005: finding foreign keys that have no index

Comments enabled. I *really* need your comment

From Stack Overflow:

How can I programmatically determine which foreign key are missing indexes?

Concise. Let's try to answer it.

A foreign key involves two tables: a referencing one and a referenced one.

In most databases, some kind of a unique key is required on the referenced table (this may be a PRIMARY KEY or just a UNIQUE constraint). This is by design (when we delete a UNIQUE value, we always delete last instance of this value).

However, indexing the FOREIGN KEY fields in the referencing table is not required. Database can function even without this index. However, this can make deletes from the referenced table slow (to validate the relationship we should perform a full table lookup on the referenced table). It almost always makes sense to index the FOREIGN KEY fields.

For an index on the FOREIGN KEY fields to be of use, its leading key part should consist of all columns that comprise the FOREIGN KEY, in any order. Since FOREIGN KEY checks are always equijoins, not range joins, the order of columns does not matter as long as they are leading in the key.

Consider a constraint: FOREIGN KEY (col1, col2) REFERENCES othertable. Which indexes are good for this foreign key?

Index Is good?
col1 No. Though it's better than nothing, only the part of the foreign key can be used in the index range.
col1, col2 Yes. It's a classical case
col2, col1 Yes. The order does not matter, since we are searching for exact values of both col1 and col2
col2, col1, col3 Yes. Despite an extra column, this index can be used to filter on col1 and col2
col3, col1, col2 No. Though both col1 and col2 are indexed, they are not in the leftmost part of the index

So for each foreign key, we should check the following condition: there exists at least one index whose leading part consists of (and only of) all columns that constitute the key.

In SQL Server 2005, the information about indexed columns is stored in sys.index_column. Since indexes can be composite, this is a one-to-many relationship, where index_id and column_id define the index and the column indexed with it, while index_column_id defines column's ordinal position in the index.

For an index to satisfy the foreign key, two conditions should be fulfilled:

  1. The number of columns in the index that are also a part of the foreign key should be equal to the maximal index_column_id. This ensures that all columns constitute the leftmost part of the index.
  2. The number of columns in the index that are also a part of the foreign key should be equal to the number of columns in the foreign key. This ensures that all foreign key columns are indexed.

And here's the query:

SELECT  fk.name, s.name, o.name,
FROM    sys.foreign_keys fk
JOIN    sys.objects o
ON      o.object_id = fk.parent_object_id
JOIN    sys.schemas s
ON      s.schema_id = o.schema_id
WHERE   EXISTS
(
SELECT  *
FROM    sys.index_columns ic
WHERE   EXISTS
(
SELECT  NULL
FROM    sys.foreign_key_columns fkc
WHERE   fkc.constraint_object_id = fk.object_id
AND fkc.parent_object_id = ic.object_id
AND fkc.parent_column_id = ic.column_id
)
GROUP BY
index_id
HAVING  COUNT(*) = MAX(index_column_id)
AND COUNT(*) =
(
SELECT  COUNT(*)
FROM    sys.foreign_key_columns fkc
WHERE   fkc.constraint_object_id = fk.object_id
)
)

Now, let's create some sample tables and check this query:

Read the rest of this entry »

Written by Quassnoi

September 10th, 2009 at 11:00 pm

Posted in SQL Server

Oracle: generating a list of dates and counting ranges for each date

with 9 comments

From Stack Overflow:

I have a table with data such as below:

Group Start Date End Date
A 2001.01.01 2001.01.03
A 2001.01.01 2001.01.02
A 2001.01.03 2001.01.03
B 2001.01.01 2001.01.01

I am looking to produce a view that gives a count for each day

We have a list of ranges here, and for each date we should count the number of ranges that contain this date.

To make this query, we will employ one simple fact: the number of ranges containing a given date is the number of ranges started on or before this date minus the number of ranges that ended before this date.

This can easily be calculated using window functions.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

September 9th, 2009 at 11:00 pm

Posted in Oracle

MySQL: matching whole sets

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a database that stores products available on the market and products still in development in two separate tables (market_product and dev_product).

A third table (substance) contains all substances a product can made of.

Other two tables (marked_product_comp and dev_product_comp) mantains product compositions.

I want to select products still in development that are made of the same ingredients of marketed products.

This means that we need to return every dev_product that has at least one market_product with complete substance match.

In the following example:

DevProductID Substance
1 Apple
1 Banana
2 Apple
2 Banana
2 Cabbage
MarketProductID Substance
1 Apple
2 Apple
2 Banana
3 Apple
3 Banana
3 Cabbage
3 Daikon

we should return development product 1, since it's made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.

We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.

In one of the previous articles I wrote about a similar task in SQL Server:

However, this used FULL OUTER JOIN which MySQL lacks.

Let's see how to implement this task in MySQL:

Read the rest of this entry »

Written by Quassnoi

September 8th, 2009 at 11:00 pm

Posted in MySQL