Selecting timestamps for a time zone
Comments enabled. I *really* need your comment
From Stack Overflow:
How can I reuse a computed column in an SQL query in MySQL?
My query is something like this:
SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id WHERE CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, the <code>CONVERT_TZ (…)</code> part is repeated. This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too. Is there any way to avoid such duplication?</blockquote> The duplication part is simple. We just need to wrap the query in a subquery or use <code>HAVING</code> clause, like this: <pre class="brush: sql"> SELECT * FROM ( SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id ) q WHERE time_usr_tz < CURRENT_TIMESTAMP() </pre> , or this: <pre class="brush: sql"> SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id HAVING time_usr_tz < CURRENT_TIMESTAMP() </pre> But what I'd really like to write about today is the performance of such a query. This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued. The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not. That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at <strong>9:00</strong> and reminded to take a pill at <strong>13:00</strong>, but if she goes from <strong>Washington, D.C.</strong> to <strong>Honolulu</strong>, she may want all her alarm clocks to change from <strong>EST</strong> to <strong>HST</strong>. Instead of updating all the timestamps it's of course easier to update a single field in her profile. The query uses columns from both tables inside a function in the <code>WHERE</code> condition, that's why no indexes can be used to filter the timestamps. If <code>schedule</code> is large enough, this can be a problem for a busy service. Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only <code>timestamp_start</code> column in the <code>schedules</code>, which is enough to illustrate the point. <span id="more-1097"></span> <a href="#" onclick="xcollapse('X5080');return false;"><strong>Table creation details</strong></a> <br /> <div id="X5080" style="display: none; "> CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE user ( user_id INT NOT NULL PRIMARY KEY, timezone VARCHAR(10) NOT NULL, name VARCHAR(50) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE schedule ( id INT NOT NULL PRIMARY KEY, user_id INT NOT NULL, timestamp_start DATETIME NOT NULL, name VARCHAR(50) NOT NULL, KEY ix_schedule_user_timestamp (user_id, timestamp_start), KEY ix_schedule_timestamp (timestamp_start) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO user (user_id, timezone, name) SELECT id, CONCAT(IF(tz >= 0, '+', ''), tz, ':00'), CONCAT('User ', id) FROM ( SELECT id, FLOOR(RAND(20090425) * 26) - 12 AS tz FROM filler ) q LIMIT 10000; INSERT INTO schedule (id, user_id, timestamp_start, name) SELECT id, id % 10000 + 1, '2009-04-25' - INTERVAL RAND(20090425 << 1) * 50000 HOUR, CONCAT('Schedule ', id) FROM filler; [/sourcecode] </div> The tables above contain <strong>1,000,000</strong> rows in <code>schedule</code> and <strong>10,000</strong> rows in <code>user</code>. Timezones are uniformly distributed over all possible values from <strong>UTC -12:00</strong> (<a href="http://en.wikipedia.org/wiki/Baker_Island">Baker Island</a>) to <strong>UTC +14:00</strong> (<a href="http://en.wikipedia.org/wiki/Line_Islands">Line Islands</a>). The original query: SELECT * FROM ( SELECT s.id, s.user_id, CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz FROM schedule s JOIN user u ON s.user_id = u.user_id ) q WHERE time_usr_tz >= '2009-04-25 10:00:00'
id user_id time_usr_tz 94569 4570 2009-04-25 11:00:00 123654 3655 2009-04-25 12:00:00 198401 8402 2009-04-25 10:00:00 478347 8348 2009-04-25 11:00:00 488454 8455 2009-04-25 10:00:00 502076 2077 2009-04-25 10:00:00 688301 8302 2009-04-25 11:00:00 822783 2784 2009-04-25 12:00:00 960232 233 2009-04-25 11:00:00 9 rows fetched in 0.0005s (4.5625s) runs for almost 5 seconds, and uses
FULL TABLE SCAN
onschedule
:
id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY <derived2> ALL 1000000 100.00 Using where 2 DERIVED s ALL 1003088 100.00 2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00 How can we deal with it?
First, we can change the expression a little. We don't need both columns to be in the formula. Instead, we can change the expression so that only
timestamp_start
will stay on the left side of the equation and the index on(timestamp_start, user_id)
can be applied:SELECT s.id, s.user_id, CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz FROM user u STRAIGHT_JOIN schedule s ON s.user_id <= u.user_id AND s.user_id >= u.user_id AND timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', timezone, '+00:00')
id user_id time_usr_tz 960232 233 2009-04-25 11:00:00 502076 2077 2009-04-25 10:00:00 822783 2784 2009-04-25 12:00:00 123654 3655 2009-04-25 12:00:00 94569 4570 2009-04-25 11:00:00 688301 8302 2009-04-25 11:00:00 478347 8348 2009-04-25 11:00:00 198401 8402 2009-04-25 10:00:00 488454 8455 2009-04-25 10:00:00 9 rows fetched in 0.0003s (0.7344s)
id select_type table type possible_keys key key_len ref rows filtered Extra 1 SIMPLE u ALL PRIMARY 10225 100.00 1 SIMPLE s ALL ix_schedule_user_timestamp,ix_schedule_timestamp 998125 100.00 Range checked for each record (index map: 0x6) This query uses
users
as a leading table and an index scan onschedule
on for each record fromusers
.Note that we used
s.user_id <= u.user_id AND s.user_id >= u.user_id
instead of meres.user_id = u.user_id
: this is required to forceMySQL
to useRANGE CHECKING FOR EACH RECORD
.This query is much faster than before, but this approach required a full scan on
users
anyway, and 10,000 rows is still many rows, and 734 milliseconds is way too long.Can we improve this query yet a little more?
Yes, we can.
If we look into the original query, we will see that for each record from
schedule
and corresponding record fromuser
we need to perform some calculation ontimestamp_start
andtimezone
, and compare the resulting value to the constant. The fact of calculation being performed hinders the index from being used.But what is this calculation, in fact? It just adds or subtracts some hours from
timestamp_start
, according to the timezone in the user's profile. This calculation cannot return anything less than -12:00 or more than +14:00 from the value stored intimestamp_start
, since there are no places onEarth
where time differs from the UTC more than these two extreme cases.Since we need all timestamps that, beign converted to user's time zone, are less than a given current date, we can safely add a coarse filtering condition to our query that filters out all
timestamp_start
's earlier than 2009-04-25 10:00:00 -12:00, since in no case our expression can satisfy something earlier than that.We can use this coarse filter to make the index usable, and then fine-filter the results just like before:
SELECT * FROM ( SELECT s.id, s.user_id, CONVERT_TZ(timestamp_start, '+00:00', timezone) AS time_usr_tz FROM schedule s JOIN user u ON u.user_id = s.user_id WHERE s.timestamp_start >= CONVERT_TZ('2009-04-25 10:00:00', '+00:00', '-12:00') ) q WHERE time_usr_tz >= '2009-04-25 10:00:00'
id user_id time_usr_tz 198401 8402 2009-04-25 10:00:00 688301 8302 2009-04-25 11:00:00 94569 4570 2009-04-25 11:00:00 123654 3655 2009-04-25 12:00:00 478347 8348 2009-04-25 11:00:00 488454 8455 2009-04-25 10:00:00 502076 2077 2009-04-25 10:00:00 822783 2784 2009-04-25 12:00:00 960232 233 2009-04-25 11:00:00 9 rows fetched in 0.0003s (0.0031s)
id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY <derived2> ALL 59 100.00 Using where 2 DERIVED s range ix_schedule_user_timestamp,ix_schedule_timestamp ix_schedule_timestamp 8 59 100.00 Using where 2 DERIVED u eq_ref PRIMARY PRIMARY 4 20090425_timestamp.s.user_id 1 100.00 By adding the seemingly redundant expression (that in fact serves as an efficient coarse filter), we make the index usable and the overall query time drops from 4,562 ms to 31 ms.
More than 100 times improvement in speed.