EXPLAIN EXTENDED

How to create fast database queries

Archive for August, 2009

Efficient date range query: SQL Server

with one comment

From Stack Overflow:

I have a table A that has a startDate and an endDate as 2 DATETIME columns besides some more other columns.

I have another table B that has one DATETIME column (call it dates column).

This is in SQL Server 2005.

How to best set up the indexes etc to get the following:

SELECT *
FROM a, b
WHERE a.startDate >= b.dates
AND a.endDate < b.dates [/sourcecode] Both tables have several thousand records.

Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.

A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).

However, this query is just the opposite: we search for a constaint between two unknown values.

If we create the indexes on startDate and endDate and issue a query with a condition like this:

startDate <= value AND endDate > value

, this will just give two range condition on two distinct indexes:

-INF <= startDate <= value
value < endDate <= +INF

Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.

There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL indexes.

In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Points, LineStrings and such).

There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):

However, SQL Server 2005 does not support SPATIAL indexes.

In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).

However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

August 11th, 2009 at 11:00 pm

Posted in SQL Server

Fallback language names: MySQL

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In Oracle, SQL Server and PostgreSQL, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

To gain efficiency, all these systems used some kind of a join method that performs better on the large portions of the rowset, i. e. HASH JOIN or MERGE JOIN.

MySQL, however, is only capable of doing nested loops, so MySQL's performance should differ from one of the engines tested earlier.

Let's create sample tables and see:

Read the rest of this entry »

Written by Quassnoi

August 10th, 2009 at 11:00 pm

Posted in MySQL

Fallback language names: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

August 8th, 2009 at 11:00 pm

Posted in PostgreSQL

Fallback language names: SQL Server

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

From Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Yesterday, we saw that the second option (two LEFT JOINs) is best in Oracle if the probability of the fallback is high (i. e. the localization is poor and but few terms are translated into the local language). I took Latin language as an example for this.

Third query (a LEFT JOIN with the local language rowset and a subquery in a COALESCE to retrive the fallback language names) is best if there is a low probability of a fallback, i. e. almost all terms are translated. In this case, I took French as an example, since it is widely used and localizations are likely to cover most terms.

Today we will see how same queries work in SQL Server.

Let's create sample tables:

Read the rest of this entry »

Written by Quassnoi

August 7th, 2009 at 11:00 pm

Posted in SQL Server

Fallback language names: Oracle

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

From Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

This is quite a common database design (most often used for localization), so I'd like to cover this question for different databases.

This design is not limited to languages and localization: it may be used for any data with default value that can be overriden. Prices, discounts, options — many kinds of things.

Today we will see how do it in Oracle.

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Let's create sample tables and see what is more efficient.
Read the rest of this entry »

Written by Quassnoi

August 6th, 2009 at 11:00 pm

Posted in Oracle

MySQL: LIMIT on LIMIT

Comments enabled. I *really* need your comment

From Stack Overflow:

Here's my situation: I want to select first ten entries from a database with id = $id.

But I want the results to be listed in a certain priority: if criteria = $criteria, then I want those results displayed first.

Otherwise, I just want to keep displaying the rest of the rows.

This can be done with a very simple query:

SELECT  *
FROM    t_preference
ORDER BY
        category = 1 DESC, id DESC
LIMIT 10

In this query, category = 1 is a boolean expression which returns TRUE when category is equal to 1, FALSE otherwise.

Since TRUE is 1 and FALSE is 0, TRUE is greater than FALSE from ORDER BY's point of view. We should apply DESC for the TRUE results (i. e. rows with category = 1) to be returned first.

However, this query is not very index friendly, and it will need to select all rows and sort them.

To improve performance, we should split this query into two index-friendly queries and merge them with a UNION.

Let's create a sample table and see how do we do it:

Read the rest of this entry »

Written by Quassnoi

August 5th, 2009 at 11:00 pm

Posted in MySQL

SQL Server: deleting adjacent values (improved)

Comments enabled. I *really* need your comment

Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:

A quick reminder of the problem, taken from Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.

In the previous article, I used a subquery to do this:

WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.day = qo.day
                                AND qi.interval < = qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    [20090803_adjacent].records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.day = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    [20090803_adjacent].records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

This subquery traverses adjacent intervals until it find the first interval that has no predecessor.

Its id then serves as a range identifier in a PARTITION BY clause.

However, this query can be improved.

Assume we have the following intervals:

interval
3
4
5
6
10
11
12
20
21

We have three ranges here: 3:6, 10:12 and 20:21.

To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:

interval ROW_NUMBER
3 1
4 2
5 3
6 4
10 5
11 6
12 7
20 8
21 9

ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.

This means that we can just subtract ROW_NUMBER from the interval:

interval ROW_NUMBER ROW_NUMBER - interval
3 1 2
4 2 2
5 3 2
6 4 2
10 5 5
11 6 5
12 7 5
20 8 12
21 9 12

Within each range, this difference is a constant. When a gap occurs, this difference changes.

Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.

This should be more efficient since it does not use the subqueries.

Let's create a sample table and check:

Read the rest of this entry »

Written by Quassnoi

August 4th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: deleting adjacent values

Comments enabled. I *really* need your comment

From Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:

SELECT  *
FROM    mytable qi
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable t
        WHERE   t.unitid = qi.unitid
                AND t.day = qi.day
                AND t.interval = qi.interval - 1
        )

However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.

If we just delete everything, the query above will not return any records just as well, but it's not our goal.

The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.

So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).

In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.

However, we can formulate the delete condition.

We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.

The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).

The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).

The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.

In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.

The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.

It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.

Let's create a sample table and compose the query:

Read the rest of this entry »

Written by Quassnoi

August 3rd, 2009 at 11:00 pm

Posted in SQL Server

MySQL: DISTINCT on DISTINCT

Comments enabled. I *really* need your comment

From Stack Overflow:

I currently have the following query:

SELECT  group_concat (DISTINCT usrFirst, usrLast) as receiver_name
FROM    …

It outputs my list of names as follows:

JohnDoe,BillSmith,DaveJones

, while desired ouput would be:

John Doe, Bill Smith, and Dave Jones

Basically, I need help figuring out three things:

  1. How can I put a space, between the first and last name?
  2. How can insert a space after each comma?
  3. How can I add an and right before the last name displayed?

This kind of output should be really processed on the client (since comma and the and word are very cultire specific).

From MySQL, we just need a recordset of distinct full names.

Let's create a sample table and see how do we do it:

Read the rest of this entry »

Written by Quassnoi

August 1st, 2009 at 11:00 pm

Posted in MySQL