EXPLAIN EXTENDED

How to create fast database queries

Author Archive

Indexing VARCHAR(MAX)

with 6 comments

SQL Server 2005 introduced new datatype: VARCHAR(MAX).

This is a replacement of old TEXT that can do anything that TEXT could, and something that TEXT could not.

Something includes possibility of being queried and compared like a plain VARCHAR. A really nice feature.

SELECT  *
FROM    [20090501_max].t_bigdata
WHERE   value LIKE 'lorem ipsum dolor sit%'

Unfortunately, VARCHAR(MAX) is still far from a plain VARCHAR. For instance, it cannot serve as an index key. It can be a part of an index (the column is included to the index leaves to avoid table lookups when using the index), but not the key of an index (the index leaves are not sorted on this column).

This means that the queries using LIKE predicate on this column will be slow.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 1st, 2009 at 11:00 pm

Posted in SQL Server

Rating movies

Comments enabled. I *really* need your comment

From Stack Overflow:

I have the following query in SQL Server:

SELECT  TOP 50 CustomerID FROM Ratings
WHERE   CustomerID != 915
        AND MovieID IN
        (
        SELECT DISTINCT MovieID
        FROM Ratings
        WHERE CustomerID = 915
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

It is super fast, but when I try to use it in a subquery like this:

SELECT  *
FROM    Ratings
WHERE   MovieID = 1
        AND CustomerID IN
        (
        SELECT  TOP 50 CustomerID FROM Ratings
        WHERE   CustomerID != 915
                AND MovieID IN
                (
                SELECT DISTINCT MovieID
                FROM Ratings
                WHERE CustomerID = 915
                )
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

, it's getting slow.

Any ideas on why this is so slow and how I can speed it up?

My PRIMARY KEY is (MovieID, CustomerID) and I added a index on CustomerID

This query is in fact a simple but efficient movie rating assistant.

When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.

This is quite a fair rating system, as if a user is into Attack of the Killer Tomatoes! and Surf Nazis Must Die, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches Titanic

Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:

  1. Build a list of movies that the user 915 rated
  2. Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
  3. Select TOP 50 users who had rated most movies of those user 915 had also rated
  4. Show their opinions on movie 1

If we look into this plan, we will see two search conditions. On step 1, we need to select movie's for a given user, while on step 2 we need to select user's for a given movie.

Since we don't need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns: MovieID and CustomerID, in different order.

Since we already have a PRIMARY KEY on (MovieID, CustomerID), we'll need to add just one more index on (CustomerID, MovieID).

And since these pairs are unique, we will also make this index UNIQUE. (My article Making an index UNIQUE describes why you should always declare UNIQUE index on intrinsically unique data.)

Having two UNIQUE indexes on the same set of columns may seem a redundancy. And in fact, it is.

But let's remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.

So if two indexes will make our query even more fast, then hail redundancy!

Let's create the sample table:
Read the rest of this entry »

Written by Quassnoi

April 30th, 2009 at 11:00 pm

Posted in SQL Server

Selecting N records for each group: PostgreSQL

Comments enabled. I *really* need your comment

In one of the previous articles:

Advanced row sampling

, I described how to select a certain number of records for each group in a MySQL table.

This is trivial in SQL Server and Oracle, since an analytic function ROW_NUMBER() can be used to do this.

Now, I'll describe how to do it in PostgreSQL.

We are assuming that the records from the table should be grouped by a field called grouper, and for each grouper, N first records should be selected, ordered first by ts, then by id.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 29th, 2009 at 11:00 pm

Posted in PostgreSQL

Descending indexes

with 2 comments

From Stack Overflow:

When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending.

I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way?

What difference does it make which order I choose?

This matters when the composite indexes are used.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 27th, 2009 at 11:00 pm

Posted in SQL Server

Keeping latest rows for a group

with one comment

From Stack Overflow:

Assume I have a table foo where I have something like this:

id, user_id, timestamp, some_value

What I want to do is remove all rows that aren't the newest N per user.

The deletion itself could be handled by a query like this:

DELETE
FROM    foo
WHERE   id NOT IN (...)

so you could rephrase the problem into this: how do I get the newest N (there might be less) rows for each user.

This means if I have U users I may end up with N×U rows, so LIMIT won't really work.

Unfortunately, NOT IN will not work here. MySQL doesn't allow using the target table in an IN or NOT IN clause in a DELETE or UPDATE statement.

But some time ago I posted two solutions:

Now it's time to use these solutions together.

What we need to do here is:

  1. Select DISTINCT user_id's from the table
  2. In a subquery, SELECT n'th timestamp and n'th id for each user_id. We need to select both these fields to handle the possible duplicates of timestamp's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in a SELECT clause, that's why we need two subselects (ugly, I know)
  3. Join the resultset with original table, for each user_id selecting all rows with (timestamp, id) lesser than the (n'th timestamp, n'th id) selected above
  4. DELETE from the joined table. Unlike IN and NOT IN conditions, using the target table in a JOIN is OK for MySQL

Now, let's create the sample table and see how it works:

Table creation details

This table has 10,000 records for 5 users.

Now, let's try to delete all records except the latest 3 for each user_id:

DELETE  l.*
FROM    foo l
JOIN    (
SELECT  user_id,
COALESCE(
(
SELECT  timestamp
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC
LIMIT 2, 1
), CAST('0001-01-01' AS DATETIME)) AS mts,
COALESCE(
(
SELECT  id
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC, li.id DESC
LIMIT 2, 1
), -1) AS mid
FROM    (
SELECT  DISTINCT user_id
FROM    foo dl
) dlo
) lo
ON      l.user_id = lo.user_id
AND (l.timestamp, l.id) < (mts, mid)

We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id's

Let's see what's left after the query finishes:

SELECT  *
FROM    foo
ORDER BY
user_id, timestamp, id
id user_id timestamp some_value
5950 1 2009-04-25 23:56:41 Value 5950
935 1 2009-04-25 23:58:52 Value 935
95 1 2009-04-25 23:59:07 Value 95
8461 2 2009-04-25 23:57:18 Value 8461
4631 2 2009-04-25 23:58:14 Value 4631
6626 2 2009-04-25 23:59:10 Value 6626
7487 3 2009-04-25 23:55:41 Value 7487
147 3 2009-04-25 23:56:08 Value 147
27 3 2009-04-25 23:59:49 Value 27
2698 4 2009-04-25 23:59:17 Value 2698
1868 4 2009-04-25 23:59:40 Value 1868
7938 4 2009-04-25 23:59:56 Value 7938
8929 5 2009-04-25 23:57:59 Value 8929
1934 5 2009-04-25 23:58:19 Value 1934
6179 5 2009-04-25 23:58:22 Value 6179
15 rows fetched in 0.0005s (0.0020s)

Just like requested: 15 records left, 3 latest records for each user_id.

Written by Quassnoi

April 26th, 2009 at 11:00 pm

Posted in MySQL

Selecting timestamps for a time zone

Comments enabled. I *really* need your comment

From Stack Overflow:

How can I reuse a computed column in an SQL query in MySQL?

My query is something like this:

SELECT CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, the CONVERT_TZ (…) part is repeated.

This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.

Is there any way to avoid such duplication?

The duplication part is simple. We just need to wrap the query in a subquery or use HAVING clause, like this:

SELECT  *
FROM    (
        SELECT  CONVERT_TZ(
                IF(timestamp_start > last_update, timestamp_start, last_update), 
                'GMT', 
                user.timezone
                ) AS time_usr_tz
        FROM    schedule
        JOIN    user
        ON      schedule.user_id = user.user_id
        ) q
WHERE   time_usr_tz < CURRENT_TIMESTAMP()

, or this:

SELECT  CONVERT_TZ(
        IF(timestamp_start > last_update, timestamp_start, last_update), 
        'GMT', 
        user.timezone
        ) AS time_usr_tz
FROM    schedule
JOIN    user
ON      schedule.user_id = user.user_id
HAVING  time_usr_tz < CURRENT_TIMESTAMP()

But what I'd really like to write about today is the performance of such a query.

This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.

The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.

That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.

The query uses columns from both tables inside a function in the WHERE condition, that's why no indexes can be used to filter the timestamps.

If schedule is large enough, this can be a problem for a busy service.

Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only timestamp_start column in the schedules, which is enough to illustrate the point.
Read the rest of this entry »

Written by Quassnoi

April 25th, 2009 at 11:00 pm

Posted in MySQL

Selecting lowest value

Comments enabled. I *really* need your comment

From Stack Overflow:

Say I have the following data:

Name Value
Small 10
Medium 100
Large 1000

Imagine that these represent the volumes of boxes.

I have some items that I want to put in the boxes, and I want the smallest box possible.

I need an SQL query that will:

  1. Return the row with the smallest row greater than my query parameter
  2. If there is no such row, then return the largest row

I'm on Oracle 11g, so any special Oracle goodness is OK.

There certainly is a special Oracle goodness, and I'll cover it in this article.

Let's create the sample table and fill it with 1,000,000 rows to illustrate the point:
Read the rest of this entry »

Written by Quassnoi

April 24th, 2009 at 11:00 pm

Posted in Oracle

Making an index UNIQUE

with one comment

From Stack Overflow:

In SQL Server, if I know an index will have unique values, how will it affect performance on INSERT's or SELECT's if I declare it as such?

If the optimiser knows the index is UNIQUE, how will that affect the query plan?

I understand that specifying uniqueness can serve to preserve integrity, but leaving that discussion aside for the moment, what are the perfomance consequences?

Let's see how does a UNIQUE index perform compared to a non-UNIQUE one.

Indexes mentioned hereafter are considered to be built on a same intrinsically unique dataset. I will leave consistency issues aside, focusing only on performance details.

A UNIQUE index, contrary to popular opinion, does not have any performance drawbacks on DML compared to a plain, non-UNIQUE index.

When the engine tries to insert a new row into the indexed table, being the index UNIQUE or not, it performs the following steps:
Read the rest of this entry »

Written by Quassnoi

April 23rd, 2009 at 11:00 pm

Posted in SQL Server

Counting missing rows: PostgreSQL

Comments enabled. I *really* need your comment

This is the 5th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Finally let's look how PostgreSQL copes with this predicate.

Let's create sample tables:
Read the rest of this entry »

Written by Quassnoi

April 22nd, 2009 at 11:00 pm

Posted in PostgreSQL

Counting missing rows: Oracle

Comments enabled. I *really* need your comment

This is the 4th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Let's see how Oracle processes this predicate.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

April 21st, 2009 at 11:00 pm

Posted in Oracle