Inverting date ranges: MySQL
Answering questions asked on the site.
Maxym asks:
Inverting date ranges is an interesting article.
I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.
Do you have any idea how to transform this query for MySQL?
The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.
The following set:
StartDate | EndDate |
---|---|
2009-11-12 | 2009-11-13 |
2009-11-15 | 2009-11-19 |
should become this:
StartDate | EndDate |
---|---|
NULL |
2009-11-12 |
2009-11-13 | 2009-11-15 |
2009-11-19 | NULL |
The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.
However, the query in question relies on FULL OUTER JOIN
and ROW_NUMBER
. MySQL lacks both of these.
To implement this query in MySQL we need to use a certain workaround.
Let's create a sample table:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_range ( id INT NOT NULL PRIMARY KEY, startdate DATETIME NOT NULL, enddate DATETIME NOT NULL, data VARCHAR(20) NOT NULL, UNIQUE KEY ux_range_startdate (startdate) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(100000); COMMIT; INSERT INTO t_range SELECT id, '2009-11-12' - INTERVAL 10 * id MINUTE, '2009-11-12' - INTERVAL 10 * id - 2 MINUTE, CONCAT('Range ', id) FROM filler; [/sourcecode] </div> This table contains <strong>100,000</strong> non-overlapping ranges. The field <code>startdate</code> is indexed with a <code>UNIQUE</code> index (since the ranges should be non-overlapping, this is a good thing to do). We should make a <q>shifted</q> join here, that is join each record with the previous one. To do this in <strong>SQL Server</strong> and other systems, <code>ROW_NUMBER()</code> can be employed which makes the join condition easy (<code>rn = rn + 1</code>) and efficient even for large datasets (since it's able to use a <code>HASH JOIN</code> or a <code>MERGE JOIN</code>). <strong>MySQL</strong> does not support <code>ROW_NUMBER</code>. But even if it did, <strong>MySQL</strong> is capable of using neither <code>HASH JOIN</code> nor <code>MERGE JOIN</code> anyway, and a <code>JOIN BUFFER</code> on a non-indexed column would take forever. <h3>Subquery</h3> We can use a subquery to find the previous record and join with it: SELECT rp.enddate, rc.startdate FROM t_range rc JOIN t_range rp ON rp.startdate = ( SELECT rn.startdate FROM t_range rn WHERE rn.startdate < rc.startdate ORDER BY rn.startdate DESC LIMIT 1 ) WHERE rc.startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00' [/sourcecode] <a href="#" onclick="xcollapse('X10743');return false;"><strong>View query results</strong></a> <br /> <div id="X10743" style="display: none; "> <div class="terminal"> <table class="terminal"> <tr> <th>enddate</th><th>startdate</th></tr> <tr><td class="timestamp">2007-12-31 23:52:00</td><td class="timestamp">2008-01-01 00:00:00</td></tr> <tr><td class="timestamp">2008-01-01 00:02:00</td><td class="timestamp">2008-01-01 00:10:00</td></tr> <tr><td class="timestamp">2008-01-01 00:12:00</td><td class="timestamp">2008-01-01 00:20:00</td></tr> <tr><td class="timestamp">2008-01-01 00:22:00</td><td class="timestamp">2008-01-01 00:30:00</td></tr> <tr><td class="timestamp">2008-01-01 00:32:00</td><td class="timestamp">2008-01-01 00:40:00</td></tr> <tr><td class="timestamp">2008-01-01 00:42:00</td><td class="timestamp">2008-01-01 00:50:00</td></tr> <tr><td class="timestamp">2008-01-01 00:52:00</td><td class="timestamp">2008-01-01 01:00:00</td></tr> <tr class="statusbar"><td colspan="100">7 rows fetched in 0.0004s (1.4687s)</td></tr> </table> </div> <div class="terminal"> <table class="terminal"> <tr> <th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>filtered</th><th>Extra</th></tr> <tr><td class="bigint">1</td><td class="varchar">PRIMARY</td><td class="varchar">rc</td><td class="varchar">range</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar"></td><td class="bigint">7</td><td class="double">100.00</td><td class="varchar">Using where; Using index</td></tr> <tr><td class="bigint">1</td><td class="varchar">PRIMARY</td><td class="varchar">rp</td><td class="varchar">eq_ref</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar">func</td><td class="bigint">1</td><td class="double">100.00</td><td class="varchar">Using where</td></tr> <tr><td class="bigint">2</td><td class="varchar">DEPENDENT SUBQUERY</td><td class="varchar">rn</td><td class="varchar">index</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar"></td><td class="bigint">1</td><td class="double">9360700.00</td><td class="varchar">Using where; Using index</td></tr> </table> </div> <pre> Field or reference '20091112_invert.rc.startdate' of SELECT #2 was resolved in SELECT #1 select `20091112_invert`.`rp`.`enddate` AS `enddate`,`20091112_invert`.`rc`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rc` join `20091112_invert`.`t_range` `rp` where ((`20091112_invert`.`rc`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') and (`20091112_invert`.`rp`.`startdate` = (select `20091112_invert`.`rn`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rn` where (`20091112_invert`.`rn`.`startdate` < `20091112_invert`.`rc`.`startdate`) order by `20091112_invert`.`rn`.`startdate` desc limit 1))) </pre> </div> This solution works, however, there is a little problem with it. <strong>MySQL</strong> does not optimize the <code>range</code> access methods in dependent subqueries. Each time the subquery is evaluated, full index scan is used, and this query takes almost <strong>1.5 seconds</strong> even on a tiny range of 7 records. Had the query been run on the whole table, it would take forever. <h3>Emulating LAG using session variables</h3> To work around this problem, we should emulate the window function <code>LAG</code> which is present in <strong>Oracle</strong>. To do this, we should use the approach described in the article about <a href="/2009/03/10/analytic-functions-first_value-last_value-lead-lag/"><strong>emulating LAG in MySQL</strong></a> that I wrote some time ago: SELECT @startdate AS _startdate, ( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate ) AS e, @startdate := startdate AS s FROM ( SELECT @startdate := CAST(NULL AS DATETIME) ) vars, t_range WHERE startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00' ORDER BY startdate
_startdate | e | s | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-01-01 00:00:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:00:00 | 2008-01-01 00:02:00 | 2008-01-01 00:10:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:10:00 | 2008-01-01 00:12:00 | 2008-01-01 00:20:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:20:00 | 2008-01-01 00:22:00 | 2008-01-01 00:30:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:30:00 | 2008-01-01 00:32:00 | 2008-01-01 00:40:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:40:00 | 2008-01-01 00:42:00 | 2008-01-01 00:50:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:50:00 | 2008-01-01 00:52:00 | 2008-01-01 01:00:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 rows fetched in 0.0003s (0.0020s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived3> | system | 1 | 100.00 | |||||
1 | PRIMARY | t_range | range | ux_range_startdate | ux_range_startdate | 8 | 7 | 100.00 | Using where; Using index | |
3 | DERIVED | No tables used | ||||||||
2 | DEPENDENT SUBQUERY | rp | eq_ref | ux_range_startdate | ux_range_startdate | 8 | func | 1 | 100.00 | Using where |
Field or reference '_startdate' of SELECT #2 was resolved in SELECT #1 select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`
This query is efficiently optimized by MySQL and takes only 2 ms.
Note that in SELECT
clause I first returned the alias: @startdate AS _startdate
and then used the alias in the subquery. This is also a hack to make MySQL use a DEPENDENT SUBQUERY (which references fields) instead of UNCACHEABLE SUBQUERY (which references session variables). The former is subject to much more efficient optimization, in particular, it can use an index access (which is what we need here).
The final gap
The query above converts all ranges to the gaps but it does not return the final gap (the one between the last range and the infinity). This is what the FULL OUTER JOIN
had been used for in the SQL Server solution.
To return this final gap, we should just append it to our recordset with a UNION ALL
. The session variable will preserve its state, that's why we won't need to use any tables at all in the query being appended:
SELECT * FROM ( SELECT @startdate AS _startdate, ( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate ) AS e, @startdate := startdate AS s FROM ( SELECT @startdate := CAST(NULL AS DATETIME) ) vars, t_range WHERE startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00' ORDER BY startdate ) q UNION ALL SELECT @startdate AS _startdate, ( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate ), NULL
_startdate | e | s | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2008-01-01 00:00:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:00:00 | 2008-01-01 00:02:00 | 2008-01-01 00:10:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:10:00 | 2008-01-01 00:12:00 | 2008-01-01 00:20:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:20:00 | 2008-01-01 00:22:00 | 2008-01-01 00:30:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:30:00 | 2008-01-01 00:32:00 | 2008-01-01 00:40:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:40:00 | 2008-01-01 00:42:00 | 2008-01-01 00:50:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 00:50:00 | 2008-01-01 00:52:00 | 2008-01-01 01:00:00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-01-01 01:00:00 | 2008-01-01 01:02:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0003s (0.0117s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 7 | 100.00 | |||||
2 | DERIVED | <derived4> | system | 1 | 100.00 | |||||
2 | DERIVED | t_range | range | ux_range_startdate | ux_range_startdate | 8 | 7 | 100.00 | Using where; Using index | |
4 | DERIVED | No tables used | ||||||||
3 | DEPENDENT SUBQUERY | rp | eq_ref | ux_range_startdate | ux_range_startdate | 8 | func | 1 | 100.00 | Using where |
5 | UNION | No tables used | ||||||||
6 | DEPENDENT SUBQUERY | rp | eq_ref | ux_range_startdate | ux_range_startdate | 8 | func | 1 | 100.00 | Using where |
UNION RESULT | <union1,5> | ALL |
Field or reference '_startdate' of SELECT #3 was resolved in SELECT #2 Field or reference '_startdate' of SELECT #6 was resolved in SELECT #5 select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate )`,NULL AS `NULL`
This returns the final gap and is almost instant as well.
Performance test
Finally, let's check the performance of the query.
To do this, we will invert the whole query and calculate the total length of all finite gaps:
SELECT COUNT(*) AS cnt, SUM(TIMEDIFF(s, e)) FROM ( SELECT * FROM ( SELECT @startdate AS _startdate, ( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate ) AS e, @startdate := startdate AS s FROM ( SELECT @startdate := CAST(NULL AS DATETIME) ) vars, t_range ORDER BY startdate ) q UNION ALL SELECT @startdate AS _startdate, ( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate ), NULL ) q2
cnt | SUM(TIMEDIFF(s, e)) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100001 | 79999200.000000 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.8281s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 100001 | 100.00 | |||||
2 | DERIVED | <derived3> | ALL | 100000 | 100.00 | |||||
3 | DERIVED | <derived5> | system | 1 | 100.00 | |||||
3 | DERIVED | t_range | index | ux_range_startdate | 8 | 93607 | 100.00 | Using index | ||
5 | DERIVED | No tables used | ||||||||
4 | DEPENDENT SUBQUERY | rp | eq_ref | ux_range_startdate | ux_range_startdate | 8 | func | 1 | 100.00 | Using where |
6 | UNCACHEABLE UNION | No tables used | ||||||||
7 | DEPENDENT SUBQUERY | rp | eq_ref | ux_range_startdate | ux_range_startdate | 8 | func | 1 | 100.00 | Using where |
UNION RESULT | <union2,6> | ALL |
Field or reference '_startdate' of SELECT #4 was resolved in SELECT #3 Field or reference '_startdate' of SELECT #7 was resolved in SELECT #6 select count(0) AS `cnt`,sum(timediff(`q2`.`s`,`q2`.`e`)) AS `SUM(TIMEDIFF(s, e))` from (select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `( SELECT enddate FROM t_range rp WHERE rp.startdate = _startdate )`,NULL AS `NULL`) `q2`
This query transforms a set of 100,000 date ranges into a set 100,001 gaps between them and takes less than 3 seconds.
This is a superb performance for a query like this.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Written by Quassnoi
November 12th, 2009 at 11:00 pm
Posted in MySQL
3 Responses to 'Inverting date ranges: MySQL'
Subscribe to comments with RSS
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
Good morning,
thx for this part of code, it’s working perfectly for me.
I’d would like to have the same result, but my table is the same than your t’range table,but with an extra field :id_room .
Is it possible to get the same result than previous, ordered by id_room ?
thx a lot.
julien
2 Jul 13 at 11:27
@julien: please do not post long code in the comments. Instead, prepare your setup on http://sqlfiddle.com and post here the link.
Quassnoi
3 Jul 13 at 13:13
ok, sorry for that.
In fact, i have understood that this code can do the same than the link above ( in sqlserver).
But in the link above, there is the idressource in the table, and not in that code in Mysql.
Is it possible to do the same in Mysql with idressource field ?
julien
4 Jul 13 at 04:18