EXPLAIN EXTENDED

How to create fast database queries

Archive for 2009

Descending indexes

with 2 comments

From Stack Overflow:

When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending.

I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way?

What difference does it make which order I choose?

This matters when the composite indexes are used.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 27th, 2009 at 11:00 pm

Posted in SQL Server

Keeping latest rows for a group

with one comment

From Stack Overflow:

Assume I have a table foo where I have something like this:

id, user_id, timestamp, some_value

What I want to do is remove all rows that aren't the newest N per user.

The deletion itself could be handled by a query like this:

DELETE
FROM    foo
WHERE   id NOT IN (...)

so you could rephrase the problem into this: how do I get the newest N (there might be less) rows for each user.

This means if I have U users I may end up with N×U rows, so LIMIT won't really work.

Unfortunately, NOT IN will not work here. MySQL doesn't allow using the target table in an IN or NOT IN clause in a DELETE or UPDATE statement.

But some time ago I posted two solutions:

Now it's time to use these solutions together.

What we need to do here is:

  1. Select DISTINCT user_id's from the table
  2. In a subquery, SELECT n'th timestamp and n'th id for each user_id. We need to select both these fields to handle the possible duplicates of timestamp's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in a SELECT clause, that's why we need two subselects (ugly, I know)
  3. Join the resultset with original table, for each user_id selecting all rows with (timestamp, id) lesser than the (n'th timestamp, n'th id) selected above
  4. DELETE from the joined table. Unlike IN and NOT IN conditions, using the target table in a JOIN is OK for MySQL

Now, let's create the sample table and see how it works:

Table creation details

This table has 10,000 records for 5 users.

Now, let's try to delete all records except the latest 3 for each user_id:

DELETE  l.*
FROM    foo l
JOIN    (
SELECT  user_id,
COALESCE(
(
SELECT  timestamp
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC
LIMIT 2, 1
), CAST('0001-01-01' AS DATETIME)) AS mts,
COALESCE(
(
SELECT  id
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC, li.id DESC
LIMIT 2, 1
), -1) AS mid
FROM    (
SELECT  DISTINCT user_id
FROM    foo dl
) dlo
) lo
ON      l.user_id = lo.user_id
AND (l.timestamp, l.id) < (mts, mid)

We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id's

Let's see what's left after the query finishes:

SELECT  *
FROM    foo
ORDER BY
user_id, timestamp, id
id user_id timestamp some_value
5950 1 2009-04-25 23:56:41 Value 5950
935 1 2009-04-25 23:58:52 Value 935
95 1 2009-04-25 23:59:07 Value 95
8461 2 2009-04-25 23:57:18 Value 8461
4631 2 2009-04-25 23:58:14 Value 4631
6626 2 2009-04-25 23:59:10 Value 6626
7487 3 2009-04-25 23:55:41 Value 7487
147 3 2009-04-25 23:56:08 Value 147
27 3 2009-04-25 23:59:49 Value 27
2698 4 2009-04-25 23:59:17 Value 2698
1868 4 2009-04-25 23:59:40 Value 1868
7938 4 2009-04-25 23:59:56 Value 7938
8929 5 2009-04-25 23:57:59 Value 8929
1934 5 2009-04-25 23:58:19 Value 1934
6179 5 2009-04-25 23:58:22 Value 6179
15 rows fetched in 0.0005s (0.0020s)

Just like requested: 15 records left, 3 latest records for each user_id.

Written by Quassnoi

April 26th, 2009 at 11:00 pm

Posted in MySQL

Selecting timestamps for a time zone

Comments enabled. I *really* need your comment

From Stack Overflow:

How can I reuse a computed column in an SQL query in MySQL?

My query is something like this:

SELECT CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, the CONVERT_TZ (…) part is repeated.

This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.

Is there any way to avoid such duplication?

The duplication part is simple. We just need to wrap the query in a subquery or use HAVING clause, like this:

SELECT  *
FROM    (
        SELECT  CONVERT_TZ(
                IF(timestamp_start > last_update, timestamp_start, last_update), 
                'GMT', 
                user.timezone
                ) AS time_usr_tz
        FROM    schedule
        JOIN    user
        ON      schedule.user_id = user.user_id
        ) q
WHERE   time_usr_tz < CURRENT_TIMESTAMP()

, or this:

SELECT  CONVERT_TZ(
        IF(timestamp_start > last_update, timestamp_start, last_update), 
        'GMT', 
        user.timezone
        ) AS time_usr_tz
FROM    schedule
JOIN    user
ON      schedule.user_id = user.user_id
HAVING  time_usr_tz < CURRENT_TIMESTAMP()

But what I'd really like to write about today is the performance of such a query.

This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.

The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.

That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.

The query uses columns from both tables inside a function in the WHERE condition, that's why no indexes can be used to filter the timestamps.

If schedule is large enough, this can be a problem for a busy service.

Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only timestamp_start column in the schedules, which is enough to illustrate the point.
Read the rest of this entry »

Written by Quassnoi

April 25th, 2009 at 11:00 pm

Posted in MySQL

Selecting lowest value

Comments enabled. I *really* need your comment

From Stack Overflow:

Say I have the following data:

Name Value
Small 10
Medium 100
Large 1000

Imagine that these represent the volumes of boxes.

I have some items that I want to put in the boxes, and I want the smallest box possible.

I need an SQL query that will:

  1. Return the row with the smallest row greater than my query parameter
  2. If there is no such row, then return the largest row

I'm on Oracle 11g, so any special Oracle goodness is OK.

There certainly is a special Oracle goodness, and I'll cover it in this article.

Let's create the sample table and fill it with 1,000,000 rows to illustrate the point:
Read the rest of this entry »

Written by Quassnoi

April 24th, 2009 at 11:00 pm

Posted in Oracle

Making an index UNIQUE

with one comment

From Stack Overflow:

In SQL Server, if I know an index will have unique values, how will it affect performance on INSERT's or SELECT's if I declare it as such?

If the optimiser knows the index is UNIQUE, how will that affect the query plan?

I understand that specifying uniqueness can serve to preserve integrity, but leaving that discussion aside for the moment, what are the perfomance consequences?

Let's see how does a UNIQUE index perform compared to a non-UNIQUE one.

Indexes mentioned hereafter are considered to be built on a same intrinsically unique dataset. I will leave consistency issues aside, focusing only on performance details.

A UNIQUE index, contrary to popular opinion, does not have any performance drawbacks on DML compared to a plain, non-UNIQUE index.

When the engine tries to insert a new row into the indexed table, being the index UNIQUE or not, it performs the following steps:
Read the rest of this entry »

Written by Quassnoi

April 23rd, 2009 at 11:00 pm

Posted in SQL Server

Counting missing rows: PostgreSQL

Comments enabled. I *really* need your comment

This is the 5th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Finally let's look how PostgreSQL copes with this predicate.

Let's create sample tables:
Read the rest of this entry »

Written by Quassnoi

April 22nd, 2009 at 11:00 pm

Posted in PostgreSQL

Counting missing rows: Oracle

Comments enabled. I *really* need your comment

This is the 4th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Let's see how Oracle processes this predicate.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

April 21st, 2009 at 11:00 pm

Posted in Oracle

Counting missing rows: SQL Server

Comments enabled. I *really* need your comment

This is the 3rd of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Now, let's see how the same predicate is implemented in SQL Server.

We'll create sample tables just like those we used yesterday:
Read the rest of this entry »

Written by Quassnoi

April 20th, 2009 at 11:00 pm

Posted in SQL Server

Counting missing rows: using DISTINCT

Comments enabled. I *really* need your comment

This is the 2nd of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Today, I will tell about some useful tricks that help to optimize these subqueries even more.

Let's create some sample tables:
Read the rest of this entry »

Written by Quassnoi

April 19th, 2009 at 11:00 pm

Posted in MySQL

Counting missing rows

Comments enabled. I *really* need your comment

From Stack Overflow:

In my first programming job I was told that a query within the IN() predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.

For example, given the query:

SELECT  COUNT(*)
FROM    Table1
WHERE   Table1Id NOT IN
	(
	SELECT	Table1Id
	FROM	Table2
	WHERE	id_user = 1
	)
Rows Inner subquery executions
10 10
100 100
1000 1000
10000 10000

Is this correct? How does the IN predicate actually work?

This is the 1st of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

In this article I will describe how MySQL implements this predicate.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

April 18th, 2009 at 11:00 pm

Posted in MySQL