EXPLAIN EXTENDED

How to create fast database queries

Archive for May, 2010

LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns

with one comment

In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.

Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.

In SQL, there are at least three methods to implement it:

LEFT JOIN / IS NULL

SELECT  o.*
FROM    outer o
LEFT JOIN
inner i
ON      i.value = o.value
WHERE   i.value IS NULL

NOT IN

SELECT  o.*
FROM    outer o
WHERE   o.value NOT IN
(
SELECT  value
FROM    inner
)

NOT EXISTS

SELECT  o.*
FROM    outer o
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    inner i
WHERE   i.value = o.value
)

When inner.value is marked as NOT NULL, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in MySQL.

Now, what if inner.value is not nullable and does contain some NULL values?

Let's create some sample tables:
Read the rest of this entry »

Written by Quassnoi

May 27th, 2010 at 11:00 pm

Posted in MySQL

Things SQL needs: determining range cardinality

with 15 comments

What is the problem with this query?

SELECT  *
FROM    orders
WHERE   quantity <= 4
        AND urgency <= 4


The problem is indexing strategy, of course. Which columns should we index?

If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.

Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.

Why, create a composite index!, some will say.

Unfortunately, that won't help much.

A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.

The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.

Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
Read the rest of this entry »

Written by Quassnoi

May 19th, 2010 at 11:00 pm

Posted in Miscellaneous

MAX and MIN on a composite index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Ivo Radev asks:

I am trying to make a very simple query.

We have a log table which different machines write to. Given the machine list, I need to find the latest log timestamp.

Currently, the query looks like this:

SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine IN ($machines)

, and I pass the comma-separated list of $machines from PHP.

The weird thing is that the query is literally instant when there is only one machine (any) in the list but slow when there are multiple machines.

I'm considering doing it in separate queries and then process the results in PHP. However I'd like to know if there is a fast solution in MySQL.

Most probably, there is a composite index on (log_machine, log_time) which is being used for the query.

Usually, a query like this:

SELECT  MAX(log_time)
FROM    log_table

on the indexed field log_time can be served with a single index seek on the index.

Indeed, the MAX(log_time), by definition, is the latest entry in the index order, and can be fetched merely by finding the trailing index entry. It's a matter of several page reads in the B-Tree, each one following the rightmost link to the lower-level page.

Similarly, this query:

SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine = $my_machine

can be served with a single index seek too. However, the index should include log_machine as a leading column.

In this case, a set of records satisfying the WHERE clause of the query is represented by a single logically continuous block of records in the index, each one sharing the same value of log_machine. MAX(log_time) will of course be held by the last record in this block. MySQL just finds that last record and takes the log_time out of it.

Now, what if we have a multiple condition on log_machine?
Read the rest of this entry »

Written by Quassnoi

May 8th, 2010 at 11:00 pm

Posted in MySQL

Things SQL needs: MERGE JOIN that would seek

with 7 comments

One of the most known and least used join algorithms in SQL engines is MERGE JOIN.

This algorithm operates on two sorted recordsets, keeping two pointers that chase each other.

The Wikipedia entry above describes it quite well in terms of algorithms. I'll just make an animated GIF to make it more clear:

This is quite a nice and elegant algorithm, which, unfortunately, has two major drawbacks:

  1. It needs the recordsets to be sorted
  2. Even with the recordsets sorted, it is no better than a HASH JOIN

The sorting part is essential for this algorithm and there is nothing that can be done with it: the recordsets should be sorted, period. Databases, however, often provide the records in the sorted order: from clustered tables, indexes, previously sorted and ordered subqueries, spool tables etc.

But even when the recordsets are already sorted, on equijoins the MERGE JOIN is hardly faster than a HASH JOIN.

Why?
Read the rest of this entry »

Written by Quassnoi

May 7th, 2010 at 11:00 pm

Posted in Miscellaneous