EXPLAIN EXTENDED

How to create fast database queries

Archive for 2009

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

with 26 comments

This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right r
)

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

Differences between the methods

These methods are quite different.

First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right

LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right's actual values.

Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right's row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row's value is not equal to one of those in t_right.

The same holds for NOT EXISTS. Since it's a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.

NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).

Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.

NOT IN, however, behaves differently.

IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:

  • TRUE is returned when the non-NULL value in question is found in the list
  • FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
  • NULL is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.

However, NOT IN is different, since negation of NULL is NULL as well.

That's why NOT IN condition will never hold for any list with a NULL value in it.

  • If a row is found in the list, IN will return TRUE and NOT IN, therefore, will return FALSE
  • If a row is not found in the list, IN will return NULL, and NOT IN on its turn will also return NULL

Both conditions will of course be filtered out by the WHERE clause.

Let's illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

value
1
NULL
2 rows fetched in 0.0001s (0.0006s)

This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right
)

value
0 rows fetched in 0.0001s (0.0005s)

This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.

IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.

However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.

Read the rest of this entry »

Written by Quassnoi

September 15th, 2009 at 11:00 pm

Posted in SQL Server

MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Magerio asks:

Hello, thanks for your great post about advanced row sampling, it helped me a lot.

I have a question.

I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).

What if I want to show the best 3 products of each category?

Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER() in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.

The original article described a query to obtain 15 first id's for each grouper. The idea behind this query was quite simple:

  1. Take a list of groupers using DISTINCT
  2. For each grouper, find the 15th id using LIMIT in a correlated subquery. We have now a set of (grouper, id15) pairs: for any given grouper we have the id of the 15th row (ordered by id).
  3. Join this resultset with the main table so that for each grouper, the id's less or equal to the 15th id are selected. We then will have 15 lowest id's for each grouper that is exactly what we were after.

This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id): a function with a single PARTITION BY expression and a single ORDER BY expression.

But in your case there are multiple ORDER BY expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality). However, the solution above relied on the fact that the field in ORDER BY was unique, so we now that there are exactly 15 rows with id less or equal to the id returned with LIMIT 15. Since this is most probably not the case with quality, we should include id into the ORDER BY clause so that it makes the ordering conidition unique.

A subquery in MySQL, however, can only return a single value.

To work around this, we should use two subqueries that will return us a single column in an ORDER BY condition.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

September 14th, 2009 at 11:00 pm

Posted in MySQL

SQL Server 2005: finding foreign keys that have no index

Comments enabled. I *really* need your comment

From Stack Overflow:

How can I programmatically determine which foreign key are missing indexes?

Concise. Let's try to answer it.

A foreign key involves two tables: a referencing one and a referenced one.

In most databases, some kind of a unique key is required on the referenced table (this may be a PRIMARY KEY or just a UNIQUE constraint). This is by design (when we delete a UNIQUE value, we always delete last instance of this value).

However, indexing the FOREIGN KEY fields in the referencing table is not required. Database can function even without this index. However, this can make deletes from the referenced table slow (to validate the relationship we should perform a full table lookup on the referenced table). It almost always makes sense to index the FOREIGN KEY fields.

For an index on the FOREIGN KEY fields to be of use, its leading key part should consist of all columns that comprise the FOREIGN KEY, in any order. Since FOREIGN KEY checks are always equijoins, not range joins, the order of columns does not matter as long as they are leading in the key.

Consider a constraint: FOREIGN KEY (col1, col2) REFERENCES othertable. Which indexes are good for this foreign key?

Index Is good?
col1 No. Though it's better than nothing, only the part of the foreign key can be used in the index range.
col1, col2 Yes. It's a classical case
col2, col1 Yes. The order does not matter, since we are searching for exact values of both col1 and col2
col2, col1, col3 Yes. Despite an extra column, this index can be used to filter on col1 and col2
col3, col1, col2 No. Though both col1 and col2 are indexed, they are not in the leftmost part of the index

So for each foreign key, we should check the following condition: there exists at least one index whose leading part consists of (and only of) all columns that constitute the key.

In SQL Server 2005, the information about indexed columns is stored in sys.index_column. Since indexes can be composite, this is a one-to-many relationship, where index_id and column_id define the index and the column indexed with it, while index_column_id defines column's ordinal position in the index.

For an index to satisfy the foreign key, two conditions should be fulfilled:

  1. The number of columns in the index that are also a part of the foreign key should be equal to the maximal index_column_id. This ensures that all columns constitute the leftmost part of the index.
  2. The number of columns in the index that are also a part of the foreign key should be equal to the number of columns in the foreign key. This ensures that all foreign key columns are indexed.

And here's the query:

SELECT  fk.name, s.name, o.name,
FROM    sys.foreign_keys fk
JOIN    sys.objects o
ON      o.object_id = fk.parent_object_id
JOIN    sys.schemas s
ON      s.schema_id = o.schema_id
WHERE   EXISTS
(
SELECT  *
FROM    sys.index_columns ic
WHERE   EXISTS
(
SELECT  NULL
FROM    sys.foreign_key_columns fkc
WHERE   fkc.constraint_object_id = fk.object_id
AND fkc.parent_object_id = ic.object_id
AND fkc.parent_column_id = ic.column_id
)
GROUP BY
index_id
HAVING  COUNT(*) = MAX(index_column_id)
AND COUNT(*) =
(
SELECT  COUNT(*)
FROM    sys.foreign_key_columns fkc
WHERE   fkc.constraint_object_id = fk.object_id
)
)

Now, let's create some sample tables and check this query:

Read the rest of this entry »

Written by Quassnoi

September 10th, 2009 at 11:00 pm

Posted in SQL Server

Oracle: generating a list of dates and counting ranges for each date

with 10 comments

From Stack Overflow:

I have a table with data such as below:

Group Start Date End Date
A 2001.01.01 2001.01.03
A 2001.01.01 2001.01.02
A 2001.01.03 2001.01.03
B 2001.01.01 2001.01.01

I am looking to produce a view that gives a count for each day

We have a list of ranges here, and for each date we should count the number of ranges that contain this date.

To make this query, we will employ one simple fact: the number of ranges containing a given date is the number of ranges started on or before this date minus the number of ranges that ended before this date.

This can easily be calculated using window functions.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

September 9th, 2009 at 11:00 pm

Posted in Oracle

MySQL: matching whole sets

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a database that stores products available on the market and products still in development in two separate tables (market_product and dev_product).

A third table (substance) contains all substances a product can made of.

Other two tables (marked_product_comp and dev_product_comp) mantains product compositions.

I want to select products still in development that are made of the same ingredients of marketed products.

This means that we need to return every dev_product that has at least one market_product with complete substance match.

In the following example:

DevProductID Substance
1 Apple
1 Banana
2 Apple
2 Banana
2 Cabbage
MarketProductID Substance
1 Apple
2 Apple
2 Banana
3 Apple
3 Banana
3 Cabbage
3 Daikon

we should return development product 1, since it's made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.

We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.

In one of the previous articles I wrote about a similar task in SQL Server:

However, this used FULL OUTER JOIN which MySQL lacks.

Let's see how to implement this task in MySQL:

Read the rest of this entry »

Written by Quassnoi

September 8th, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL: selecting items that belong to all categories

Comments enabled. I *really* need your comment

From Stack Overflow:

I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:

Project Category
Foo Apple
Foo Banana
Foo Carrot
Bar Apple
Bar Carrot
Qux Apple
Qux Banana

What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.

For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.

If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.

A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

August 25th, 2009 at 11:00 pm

Posted in PostgreSQL

SQL Server: leading wildcard match using an index

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm joining to a table dozens of different times, and every time, I join (or filter) based on the results of a SUBSTRING of one of the columns (it's a string, but left-padded with zeros, and I don't care about the last four digits).

As a result, even though this column is indexed and my query would use the index, it does a table scan because the SUBSTRING itself isn't indexed, so SQL Server has to calculate it for every row before as it's joining.

I'm looking for any ideas on how to speed up this process.

This problem is called leading wildcard match. Though it's possible to implement it, it usually requires indexing all possible combinations of substrings which usually bloats the index.

However, in this very case it's possible to create quite an efficient index. It is possible because the number of characters we need to trim from the right is fixed.

The main idea is to create a computed column which would store trimmed and reversed value of the string, index that column and use this index in search.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

August 24th, 2009 at 11:00 pm

Posted in SQL Server

What is a relational database?

with 3 comments

My blog is about SQL, and SQL is about relational databases.

So, what's SQL and what's a relational database and what's an entity-relationship model?

These terms are often misused and this leads to confusion.

In this article, I'd like to explain what a relational database is.

For starters, a little quote from Jack London's The League of the Old Men:

An Indian healer

An Indian healer

Howkan began to read. Imber listened for a space, when a wonderment rose up in his face and he broke in abruptly.

That be my talk, Howkan. Yet from thy lips it comes when thy ears have not heard.

Howkan smirked with self-appreciation. His hair was parted in the middle. Nay, from the paper it comes, O Imber. Never have my ears heard. From the paper it comes, through my eyes, into my head, and out of my mouth to thee. Thus it comes.

Thus it comes? It be there in the paper? Imber's voice sank in whisperful awe as he crackled the sheets 'twixt thumb and finger and stared at the charactery scrawled thereon. It be a great medicine, Howkan, and thou art a worker of wonders.

It be nothing, it be nothing, the young man responded carelessly and pridefully. He read at hazard from the document: In that year, before the break of the ice, came an old man, and a boy who was lame of one foot. These also did I kill, and the old man made much noise—

It be true, Imber interrupted breathlessly. He made much noise and would not die for a long time. But how dost thou know, Howkan? The chief man of the white men told thee, mayhap? No one beheld me, and him alone have I told.

Howkan shook his head with impatience. Have I not told thee it be there in the paper, O fool?

Imber stared hard at the ink-scrawled surface. As the hunter looks upon the snow and says, Here but yesterday there passed a rabbit; and here by the willow scrub it stood and listened, and heard, and was afraid; and here it turned upon its trail; and here it went with great swiftness, leaping wide; and here, with greater swiftness and wider leapings, came a lynx; and here, where the claws cut deep into the snow, the lynx made a very great leap; and here it struck, with the rabbit under and rolling belly up; and here leads off the trail of the lynx alone, and there is no more rabbit,—as the hunter looks upon the markings of the snow and says thus and so and here, dost thou, too, look upon the paper and say thus and so and here be the things old Imber hath done?

Old Imber made a great analogy. Information is but markings on the snow. When you look at the snow, there are no rabbit or lynx there. But when an experienced hunter sees their traces he can go and find them.

However, the old Indian could only read the information while his nephew could also create it by writing the words on paper and reading them back. He had full control over the information he created.

And for information to be useful, it should be structured.

When a car dealer describes a car he really wants to sell he can spend half an hour talking of magnificent design, good extras and a decent price, but just forget to mention that, say, the warranty terms are not those you'd expect.

But when you fill a credit card application form, you answer a set of predefined questions. What is your name? Where do you live? How much do you earn? Do you have incurable cancer?

The bank needs all this information to make sure you will pay your debt (or at least they can get a decent price for your debt when it's sold to a collection agency).

And they want you to hide no important information about yourself and to represent all information in such a way that it can be easily interpreted and make sure nothing they are interested in would be forgotten to answer or left behind.

They want the data to be structured. And by making you to fill a form they enforce the structure of the information you provide.

They force you to use their data model.

Read the rest of this entry »

Written by Quassnoi

August 23rd, 2009 at 11:00 pm

Posted in Miscellaneous

SQL Server: finding date closest to a birthdate

Comments enabled. I *really* need your comment

From Stack Overflow:

The problem is with dynamically getting upcoming what we call namedays in Europe.

Name day is a tradition in many countries in Europe and Latin America of celebrating on a particular day of the year associated with the one's given name.

Here's a sampe name day table (I store the date in this format: 1900-08-22, but we really only need month and day):

First name Name day
Bob 1900-04-22
Bob 1900-09-04
Frank 1900-01-02

The trick is that there might be multiple name days for each name, and someones name day is always the first one found after ones birthday.

So if Bob was born on August 5th, his name day would fall on September 4th, but if we were born after Sep 4th, his name day would be on April 22nd.

We need to find when a given person celebrates his/her name day based on the first name and birthdate.

What I need is an SQL Server query that will be able to get me name days for the people in my database.

We actually need to make the calendar to wrap around: to build a condition that would select 1904, Jan 1st after 1904, Dec 31th.

We can do it using a COALESCE on the simple queries in OUTER APPLY clause.

Let's create sample tables:

Read the rest of this entry »

Written by Quassnoi

August 21st, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: efficient DISTINCT on dates

Comments enabled. I *really* need your comment

From Stack Overflow:

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT).

The problem is whether there is any trick to get it done more quickly.

With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

The problem with this query is that no distinct list of truncated dates is easily accessible.

That's why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.

In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME column. Then it would just use jump over the distinct index keys.

Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).

SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.

That's why creating an index on a computed column doesn't help.

But since we are talking about the dates, it's possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

August 20th, 2009 at 11:00 pm

Posted in SQL Server