EXPLAIN EXTENDED

How to create fast database queries

Selecting timestamps for a time zone

Comments enabled. I *really* need your comment

From Stack Overflow:

How can I reuse a computed column in an SQL query in MySQL?

My query is something like this:

SELECT  CONVERT_TZ(
        IF(timestamp_start > last_update, timestamp_start, last_update), 
        'GMT', 
        user.timezone
        ) AS time_usr_tz
FROM    schedule
JOIN    user
ON      schedule.user_id = user.user_id
WHERE   CONVERT_TZ(
        IF(timestamp_start > last_update, timestamp_start, last_update), 
        'GMT', 
        user.timezone
        ) < CURRENT_TIMESTAMP()
&#91;/sourcecode&#93;

As you can see, the <code>CONVERT_TZ (…)</code> part is repeated.

This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.

Is there any way to avoid such duplication?</blockquote>

The duplication part is simple. We just need to wrap the query in a subquery or use <code>HAVING</code> clause, like this:

<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  CONVERT_TZ(
                IF(timestamp_start > last_update, timestamp_start, last_update), 
                'GMT', 
                user.timezone
                ) AS time_usr_tz
        FROM    schedule
        JOIN    user
        ON      schedule.user_id = user.user_id
        ) q
WHERE   time_usr_tz < CURRENT_TIMESTAMP()
</pre>

, or this:

<pre class="brush: sql">
SELECT  CONVERT_TZ(
        IF(timestamp_start > last_update, timestamp_start, last_update), 
        'GMT', 
        user.timezone
        ) AS time_usr_tz
FROM    schedule
JOIN    user
ON      schedule.user_id = user.user_id
HAVING  time_usr_tz < CURRENT_TIMESTAMP()
</pre>

But what I'd really like to write about today is the performance of such a query.

This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.

The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.

That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at <strong>9:00</strong> and reminded to take a pill at <strong>13:00</strong>, but if she goes from <strong>Washington, D.C.</strong> to <strong>Honolulu</strong>, she may want all her alarm clocks to change from <strong>EST</strong> to <strong>HST</strong>. Instead of updating all the timestamps it's of course easier to update a single field in her profile.

The query uses columns from both tables inside a function in the <code>WHERE</code> condition, that's why no indexes can be used to filter the timestamps.

If <code>schedule</code> is large enough, this can be a problem for a busy service.

Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only <code>timestamp_start</code> column in the <code>schedules</code>, which is enough to illustrate the point.
<span id="more-1097"></span>
<a href="#" onclick="xcollapse('X5080');return false;"><strong>Table creation details</strong></a>
<br />
<div id="X5080" style="display: none; ">

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE user (
        user_id INT NOT NULL PRIMARY KEY,
        timezone VARCHAR(10) NOT NULL,
        name VARCHAR(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE schedule (
        id INT NOT NULL PRIMARY KEY,
        user_id INT NOT NULL,
        timestamp_start DATETIME NOT NULL,
        name VARCHAR(50) NOT NULL,
        KEY ix_schedule_user_timestamp (user_id, timestamp_start),
        KEY ix_schedule_timestamp (timestamp_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    user (user_id, timezone, name)
SELECT  id, CONCAT(IF(tz >= 0, '+', ''), tz, ':00'),
        CONCAT('User ', id)
FROM    (
        SELECT id, FLOOR(RAND(20090425) * 26) - 12 AS tz
        FROM    filler
        ) q
LIMIT 10000;

INSERT
INTO    schedule (id, user_id, timestamp_start, name)
SELECT  id, id % 10000 + 1,
        '2009-04-25' - INTERVAL RAND(20090425 << 1) * 50000 HOUR,
        CONCAT('Schedule ', id)
FROM    filler;
&#91;/sourcecode&#93;
</div>

The tables above contain <strong>1,000,000</strong> rows in <code>schedule</code> and <strong>10,000</strong> rows in <code>user</code>. Timezones are uniformly distributed over all possible values from <strong>UTC -12:00</strong> (<a href="http://en.wikipedia.org/wiki/Baker_Island">Baker Island</a>) to <strong>UTC +14:00</strong> (<a href="http://en.wikipedia.org/wiki/Line_Islands">Line Islands</a>).

The original query:


SELECT  *
FROM    (
        SELECT  s.id, s.user_id,
                CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
        FROM    schedule s
        JOIN    user u
        ON      s.user_id = u.user_id
        ) q
WHERE   time_usr_tz >= '2009-04-25 10:00:00'

id user_id time_usr_tz
94569 4570 2009-04-25 11:00:00
123654 3655 2009-04-25 12:00:00
198401 8402 2009-04-25 10:00:00
478347 8348 2009-04-25 11:00:00
488454 8455 2009-04-25 10:00:00
502076 2077 2009-04-25 10:00:00
688301 8302 2009-04-25 11:00:00
822783 2784 2009-04-25 12:00:00
960232 233 2009-04-25 11:00:00
9 rows fetched in 0.0005s (4.5625s)

runs for almost 5 seconds, and uses FULL TABLE SCAN on schedule:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 1000000 100.00 Using where
2 DERIVED s ALL 1003088 100.00
2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00

How can we deal with it?

First, we can change the expression a little. We don't need both columns to be in the formula. Instead, we can change the expression so that only timestamp_start will stay on the left side of the equation and the index on (timestamp_start, user_id) can be applied:

SELECT  s.id, s.user_id,
        CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
FROM    user u
STRAIGHT_JOIN    schedule s
ON      s.user_id <= u.user_id
        AND s.user_id >= u.user_id
        AND timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', timezone, '+00:00')

id user_id time_usr_tz
960232 233 2009-04-25 11:00:00
502076 2077 2009-04-25 10:00:00
822783 2784 2009-04-25 12:00:00
123654 3655 2009-04-25 12:00:00
94569 4570 2009-04-25 11:00:00
688301 8302 2009-04-25 11:00:00
478347 8348 2009-04-25 11:00:00
198401 8402 2009-04-25 10:00:00
488454 8455 2009-04-25 10:00:00
9 rows fetched in 0.0003s (0.7344s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE u ALL PRIMARY 10225 100.00
1 SIMPLE s ALL ix_schedule_user_timestamp,ix_schedule_timestamp 998125 100.00 Range checked for each record (index map: 0x6)

This query uses users as a leading table and an index scan on schedule on for each record from users.

Note that we used s.user_id <= u.user_id AND s.user_id >= u.user_id instead of mere s.user_id = u.user_id: this is required to force MySQL to use RANGE CHECKING FOR EACH RECORD.

This query is much faster than before, but this approach required a full scan on users anyway, and 10,000 rows is still many rows, and 734 milliseconds is way too long.

Can we improve this query yet a little more?

Yes, we can.

If we look into the original query, we will see that for each record from schedule and corresponding record from user we need to perform some calculation on timestamp_start and timezone, and compare the resulting value to the constant. The fact of calculation being performed hinders the index from being used.

But what is this calculation, in fact? It just adds or subtracts some hours from timestamp_start, according to the timezone in the user's profile. This calculation cannot return anything less than -12:00 or more than +14:00 from the value stored in timestamp_start, since there are no places on Earth where time differs from the UTC more than these two extreme cases.

Since we need all timestamps that, beign converted to user's time zone, are less than a given current date, we can safely add a coarse filtering condition to our query that filters out all timestamp_start's earlier than 2009-04-25 10:00:00 -12:00, since in no case our expression can satisfy something earlier than that.

We can use this coarse filter to make the index usable, and then fine-filter the results just like before:

SELECT  *
FROM    (
        SELECT  s.id, s.user_id,
                CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
        FROM    schedule s
        JOIN    user u
        ON      u.user_id = s.user_id
        WHERE   s.timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', '+00:00', '-12:00')
        ) q
WHERE   time_usr_tz >= '2009-04-25 10:00:00'

id user_id time_usr_tz
198401 8402 2009-04-25 10:00:00
688301 8302 2009-04-25 11:00:00
94569 4570 2009-04-25 11:00:00
123654 3655 2009-04-25 12:00:00
478347 8348 2009-04-25 11:00:00
488454 8455 2009-04-25 10:00:00
502076 2077 2009-04-25 10:00:00
822783 2784 2009-04-25 12:00:00
960232 233 2009-04-25 11:00:00
9 rows fetched in 0.0003s (0.0031s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 59 100.00 Using where
2 DERIVED s range ix_schedule_user_timestamp,ix_schedule_timestamp ix_schedule_timestamp 8 59 100.00 Using where
2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00

By adding the seemingly redundant expression (that in fact serves as an efficient coarse filter), we make the index usable and the overall query time drops from 4,562 ms to 31 ms.

More than 100 times improvement in speed.

Written by Quassnoi

April 25th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply