MySQL: filtering timestamps with a timezone at EXPLAIN EXTENDED

How can I reuse a computed column in an SQL query in MySQL?

My query is something like this:
SELECT CONVERT_TZ(
 IF(timestamp_start > last_update, timestamp_start, last_update), 
 'GMT', 
 user.timezone
 ) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
 IF(timestamp_start > last_update, timestamp_start, last_update), 
 'GMT', 
 user.timezone
 ) < CURRENT_TIMESTAMP()
&#91;/sourcecode&#93;

As you can see, the <code>CONVERT_TZ (…)</code> part is repeated.

This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.

Is there any way to avoid such duplication?</blockquote>

The duplication part is simple. We just need to wrap the query in a subquery or use <code>HAVING</code> clause, like this:

<pre class="brush: sql">
SELECT *
FROM (
 SELECT CONVERT_TZ(
 IF(timestamp_start > last_update, timestamp_start, last_update), 
 'GMT', 
 user.timezone
 ) AS time_usr_tz
 FROM schedule
 JOIN user
 ON schedule.user_id = user.user_id
 ) q
WHERE time_usr_tz < CURRENT_TIMESTAMP()
</pre>

, or this:

<pre class="brush: sql">
SELECT CONVERT_TZ(
 IF(timestamp_start > last_update, timestamp_start, last_update), 
 'GMT', 
 user.timezone
 ) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
HAVING time_usr_tz < CURRENT_TIMESTAMP()
</pre>

But what I'd really like to write about today is the performance of such a query.

This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.

The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.

That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.

The query uses columns from both tables inside a function in the <code>WHERE</code> condition, that's why no indexes can be used to filter the timestamps.

If <code>schedule</code> is large enough, this can be a problem for a busy service.

Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only <code>timestamp_start</code> column in the <code>schedules</code>, which is enough to illustrate the point.

<a href="#" onclick="xcollapse('X5080');return false;">Table creation details</a>
 
<div id="X5080" style="display: none; ">

CREATE TABLE filler (
 id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE user (
 user_id INT NOT NULL PRIMARY KEY,
 timezone VARCHAR(10) NOT NULL,
 name VARCHAR(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE schedule (
 id INT NOT NULL PRIMARY KEY,
 user_id INT NOT NULL,
 timestamp_start DATETIME NOT NULL,
 name VARCHAR(50) NOT NULL,
 KEY ix_schedule_user_timestamp (user_id, timestamp_start),
 KEY ix_schedule_timestamp (timestamp_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
 DECLARE _cnt INT;
 SET _cnt = 1;
 WHILE _cnt <= cnt DO
 INSERT
 INTO filler
 SELECT _cnt;
 SET _cnt = _cnt + 1;
 END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO user (user_id, timezone, name)
SELECT id, CONCAT(IF(tz >= 0, '+', ''), tz, ':00'),
 CONCAT('User ', id)
FROM (
 SELECT id, FLOOR(RAND(20090425) * 26) - 12 AS tz
 FROM filler
 ) q
LIMIT 10000;

INSERT
INTO schedule (id, user_id, timestamp_start, name)
SELECT id, id % 10000 + 1,
 '2009-04-25' - INTERVAL RAND(20090425 << 1) * 50000 HOUR,
 CONCAT('Schedule ', id)
FROM filler;
&#91;/sourcecode&#93;
</div>

The tables above contain 1,000,000 rows in <code>schedule</code> and 10,000 rows in <code>user</code>. Timezones are uniformly distributed over all possible values from UTC -12:00 (<a href="http://en.wikipedia.org/wiki/Baker_Island">Baker Island</a>) to UTC +14:00 (<a href="http://en.wikipedia.org/wiki/Line_Islands">Line Islands</a>).

The original query:


SELECT *
FROM (
 SELECT s.id, s.user_id,
 CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
 FROM schedule s
 JOIN user u
 ON s.user_id = u.user_id
 ) q
WHERE time_usr_tz >= '2009-04-25 10:00:00'
id user_id time_usr_tz

94569 4570 2009-04-25 11:00:00

123654 3655 2009-04-25 12:00:00

198401 8402 2009-04-25 10:00:00

478347 8348 2009-04-25 11:00:00

488454 8455 2009-04-25 10:00:00

502076 2077 2009-04-25 10:00:00

688301 8302 2009-04-25 11:00:00

822783 2784 2009-04-25 12:00:00

960232 233 2009-04-25 11:00:00

9 rows fetched in 0.0005s (4.5625s)

runs for almost 5 seconds, and uses FULL TABLE SCAN on schedule:

id select_type table type possible_keys key key_len ref rows filtered Extra

1 PRIMARY <derived2> ALL 1000000 100.00 Using where

2 DERIVED s ALL 1003088 100.00

2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00

How can we deal with it?

First, we can change the expression a little. We don't need both columns to be in the formula. Instead, we can change the expression so that only timestamp_start will stay on the left side of the equation and the index on (timestamp_start, user_id) can be applied:
SELECT s.id, s.user_id,
 CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
FROM user u
STRAIGHT_JOIN schedule s
ON s.user_id <= u.user_id
 AND s.user_id >= u.user_id
 AND timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', timezone, '+00:00')
id user_id time_usr_tz

960232 233 2009-04-25 11:00:00

502076 2077 2009-04-25 10:00:00

822783 2784 2009-04-25 12:00:00

123654 3655 2009-04-25 12:00:00

94569 4570 2009-04-25 11:00:00

688301 8302 2009-04-25 11:00:00

478347 8348 2009-04-25 11:00:00

198401 8402 2009-04-25 10:00:00

488454 8455 2009-04-25 10:00:00

9 rows fetched in 0.0003s (0.7344s)

id select_type table type possible_keys key key_len ref rows filtered Extra

1 SIMPLE u ALL PRIMARY 10225 100.00

1 SIMPLE s ALL ix_schedule_user_timestamp,ix_schedule_timestamp 998125 100.00 Range checked for each record (index map: 0x6)

This query uses users as a leading table and an index scan on schedule on for each record from users.

Note that we used s.user_id <= u.user_id AND s.user_id >= u.user_id instead of mere s.user_id = u.user_id: this is required to force MySQL to use RANGE CHECKING FOR EACH RECORD.

This query is much faster than before, but this approach required a full scan on users anyway, and 10,000 rows is still many rows, and 734 milliseconds is way too long.

Can we improve this query yet a little more?

Yes, we can.

If we look into the original query, we will see that for each record from schedule and corresponding record from user we need to perform some calculation on timestamp_start and timezone, and compare the resulting value to the constant. The fact of calculation being performed hinders the index from being used.

But what is this calculation, in fact? It just adds or subtracts some hours from timestamp_start, according to the timezone in the user's profile. This calculation cannot return anything less than -12:00 or more than +14:00 from the value stored in timestamp_start, since there are no places on Earth where time differs from the UTC more than these two extreme cases.

Since we need all timestamps that, beign converted to user's time zone, are less than a given current date, we can safely add a coarse filtering condition to our query that filters out all timestamp_start's earlier than 2009-04-25 10:00:00 -12:00, since in no case our expression can satisfy something earlier than that.

We can use this coarse filter to make the index usable, and then fine-filter the results just like before:
SELECT *
FROM (
 SELECT s.id, s.user_id,
 CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz
 FROM schedule s
 JOIN user u
 ON u.user_id = s.user_id
 WHERE s.timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', '+00:00', '-12:00')
 ) q
WHERE time_usr_tz >= '2009-04-25 10:00:00'
id user_id time_usr_tz

198401 8402 2009-04-25 10:00:00

688301 8302 2009-04-25 11:00:00

94569 4570 2009-04-25 11:00:00

123654 3655 2009-04-25 12:00:00

478347 8348 2009-04-25 11:00:00

488454 8455 2009-04-25 10:00:00

502076 2077 2009-04-25 10:00:00

822783 2784 2009-04-25 12:00:00

960232 233 2009-04-25 11:00:00

9 rows fetched in 0.0003s (0.0031s)

id select_type table type possible_keys key key_len ref rows filtered Extra

1 PRIMARY <derived2> ALL 59 100.00 Using where

2 DERIVED s range ix_schedule_user_timestamp,ix_schedule_timestamp ix_schedule_timestamp 8 59 100.00 Using where

2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00

By adding the seemingly redundant expression (that in fact serves as an efficient coarse filter), we make the index usable and the overall query time drops from 4,562 ms to 31 ms.

More than 100 times improvement in speed.

Share this:
Facebook
X

id	user_id	time_usr_tz
94569	4570	2009-04-25 11:00:00
123654	3655	2009-04-25 12:00:00
198401	8402	2009-04-25 10:00:00
478347	8348	2009-04-25 11:00:00
488454	8455	2009-04-25 10:00:00
502076	2077	2009-04-25 10:00:00
688301	8302	2009-04-25 11:00:00
822783	2784	2009-04-25 12:00:00
960232	233	2009-04-25 11:00:00
9 rows fetched in 0.0005s (4.5625s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					1000000	100.00	Using where
2	DERIVED	s	ALL					1003088	100.00
2	DERIVED	u	eq_ref	PRIMARY	PRIMARY	4	20090425_timestamp.s.user_id	1	100.00

id	user_id	time_usr_tz
960232	233	2009-04-25 11:00:00
502076	2077	2009-04-25 10:00:00
822783	2784	2009-04-25 12:00:00
123654	3655	2009-04-25 12:00:00
94569	4570	2009-04-25 11:00:00
688301	8302	2009-04-25 11:00:00
478347	8348	2009-04-25 11:00:00
198401	8402	2009-04-25 10:00:00
488454	8455	2009-04-25 10:00:00
9 rows fetched in 0.0003s (0.7344s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	u	ALL	PRIMARY				10225	100.00
1	SIMPLE	s	ALL	ix_schedule_user_timestamp,ix_schedule_timestamp				998125	100.00	Range checked for each record (index map: 0x6)

id	user_id	time_usr_tz
198401	8402	2009-04-25 10:00:00
688301	8302	2009-04-25 11:00:00
94569	4570	2009-04-25 11:00:00
123654	3655	2009-04-25 12:00:00
478347	8348	2009-04-25 11:00:00
488454	8455	2009-04-25 10:00:00
502076	2077	2009-04-25 10:00:00
822783	2784	2009-04-25 12:00:00
960232	233	2009-04-25 11:00:00
9 rows fetched in 0.0003s (0.0031s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					59	100.00	Using where
2	DERIVED	s	range	ix_schedule_user_timestamp,ix_schedule_timestamp	ix_schedule_timestamp	8		59	100.00	Using where
2	DERIVED	u	eq_ref	PRIMARY	PRIMARY	4	20090425_timestamp.s.user_id	1	100.00

Written by Quassnoi

April 25th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Selecting timestamps for a time zone

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow