Inverting date ranges: MySQL at EXPLAIN EXTENDED

Inverting date ranges: MySQL

Answering questions asked on the site.

Maxym asks:

Inverting date ranges is an interesting article.

I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.

Do you have any idea how to transform this query for MySQL?

The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.

The following set:

StartDate	EndDate
2009-11-12	2009-11-13
2009-11-15	2009-11-19

should become this:

StartDate	EndDate
`NULL`	2009-11-12
2009-11-13	2009-11-15
2009-11-19	`NULL`

The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.

However, the query in question relies on FULL OUTER JOIN and ROW_NUMBER. MySQL lacks both of these.

To implement this query in MySQL we need to use a certain workaround.

Let's create a sample table:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_range (
        id INT NOT NULL PRIMARY KEY,
        startdate DATETIME NOT NULL,
        enddate DATETIME NOT NULL,
        data VARCHAR(20) NOT NULL,
        UNIQUE KEY ux_range_startdate (startdate)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_range
SELECT  id,
        '2009-11-12' - INTERVAL 10 * id MINUTE,
        '2009-11-12' - INTERVAL 10 * id - 2 MINUTE,
        CONCAT('Range ', id)
FROM    filler;
&#91;/sourcecode&#93;
</div>

This table contains <strong>100,000</strong> non-overlapping ranges. The field <code>startdate</code> is indexed with a <code>UNIQUE</code> index (since the ranges should be non-overlapping, this is a good thing to do).

We should make a <q>shifted</q> join here, that is join each record with the previous one. To do this in <strong>SQL Server</strong> and other systems, <code>ROW_NUMBER()</code> can be employed which makes the join condition easy (<code>rn = rn + 1</code>) and efficient even for large datasets (since it's able to use a <code>HASH JOIN</code> or a <code>MERGE JOIN</code>).

<strong>MySQL</strong> does not support <code>ROW_NUMBER</code>. But even if it did, <strong>MySQL</strong> is capable of using neither <code>HASH JOIN</code> nor <code>MERGE JOIN</code> anyway, and a <code>JOIN BUFFER</code> on a non-indexed column would take forever.

<h3>Subquery</h3>

We can use a subquery to find the previous record and join with it:


SELECT  rp.enddate, rc.startdate
FROM    t_range rc
JOIN    t_range rp
ON      rp.startdate =
        (
        SELECT  rn.startdate
        FROM    t_range rn
        WHERE   rn.startdate < rc.startdate
        ORDER BY
                rn.startdate DESC
        LIMIT 1
        )
WHERE   rc.startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
&#91;/sourcecode&#93;

<a href="#" onclick="xcollapse('X10743');return false;"><strong>View query results</strong></a>
<br />
<div id="X10743" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>enddate</th><th>startdate</th></tr>
<tr><td class="timestamp">2007-12-31 23:52:00</td><td class="timestamp">2008-01-01 00:00:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:02:00</td><td class="timestamp">2008-01-01 00:10:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:12:00</td><td class="timestamp">2008-01-01 00:20:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:22:00</td><td class="timestamp">2008-01-01 00:30:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:32:00</td><td class="timestamp">2008-01-01 00:40:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:42:00</td><td class="timestamp">2008-01-01 00:50:00</td></tr>
<tr><td class="timestamp">2008-01-01 00:52:00</td><td class="timestamp">2008-01-01 01:00:00</td></tr>
<tr class="statusbar"><td colspan="100">7 rows fetched in 0.0004s (1.4687s)</td></tr>
</table>
</div>

<div class="terminal">
<table class="terminal">
<tr>
<th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>filtered</th><th>Extra</th></tr>
<tr><td class="bigint">1</td><td class="varchar">PRIMARY</td><td class="varchar">rc</td><td class="varchar">range</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar"></td><td class="bigint">7</td><td class="double">100.00</td><td class="varchar">Using where; Using index</td></tr>
<tr><td class="bigint">1</td><td class="varchar">PRIMARY</td><td class="varchar">rp</td><td class="varchar">eq_ref</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar">func</td><td class="bigint">1</td><td class="double">100.00</td><td class="varchar">Using where</td></tr>
<tr><td class="bigint">2</td><td class="varchar">DEPENDENT SUBQUERY</td><td class="varchar">rn</td><td class="varchar">index</td><td class="varchar">ux_range_startdate</td><td class="varchar">ux_range_startdate</td><td class="varchar">8</td><td class="varchar"></td><td class="bigint">1</td><td class="double">9360700.00</td><td class="varchar">Using where; Using index</td></tr>
</table>
</div>

<pre>
Field or reference &#39;20091112_invert.rc.startdate&#39; of SELECT #2 was resolved in SELECT #1
select `20091112_invert`.`rp`.`enddate` AS `enddate`,`20091112_invert`.`rc`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rc` join `20091112_invert`.`t_range` `rp` where ((`20091112_invert`.`rc`.`startdate` between &#39;2008-01-01 00:00:00&#39; and &#39;2008-01-01 01:00:00&#39;) and (`20091112_invert`.`rp`.`startdate` = (select `20091112_invert`.`rn`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rn` where (`20091112_invert`.`rn`.`startdate` &lt; `20091112_invert`.`rc`.`startdate`) order by `20091112_invert`.`rn`.`startdate` desc limit 1)))
</pre>
</div>

This solution works, however, there is a little problem with it. <strong>MySQL</strong> does not optimize the <code>range</code> access methods in dependent subqueries. Each time the subquery is evaluated, full index scan is used, and this query takes almost <strong>1.5 seconds</strong> even on a tiny range of 7 records. Had the query been run on the whole table, it would take forever.

<h3>Emulating LAG using session variables</h3>

To work around this problem, we should emulate the window function <code>LAG</code> which is present in <strong>Oracle</strong>.

To do this, we should use the approach described in the article about <a href="/2009/03/10/analytic-functions-first_value-last_value-lead-lag/"><strong>emulating LAG in MySQL</strong></a> that I wrote some time ago:


SELECT  @startdate AS _startdate,
        (
        SELECT  enddate
        FROM    t_range rp
        WHERE   rp.startdate = _startdate
        ) AS e,
        @startdate := startdate AS s
FROM    (
        SELECT  @startdate := CAST(NULL AS DATETIME)
        ) vars,
        t_range
WHERE   startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
ORDER BY
        startdate

View query results

_startdate	e	s
		2008-01-01 00:00:00
2008-01-01 00:00:00	2008-01-01 00:02:00	2008-01-01 00:10:00
2008-01-01 00:10:00	2008-01-01 00:12:00	2008-01-01 00:20:00
2008-01-01 00:20:00	2008-01-01 00:22:00	2008-01-01 00:30:00
2008-01-01 00:30:00	2008-01-01 00:32:00	2008-01-01 00:40:00
2008-01-01 00:40:00	2008-01-01 00:42:00	2008-01-01 00:50:00
2008-01-01 00:50:00	2008-01-01 00:52:00	2008-01-01 01:00:00
7 rows fetched in 0.0003s (0.0020s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived3>	system					1	100.00
1	PRIMARY	t_range	range	ux_range_startdate	ux_range_startdate	8		7	100.00	Using where; Using index
3	DERIVED									No tables used
2	DEPENDENT SUBQUERY	rp	eq_ref	ux_range_startdate	ux_range_startdate	8	func	1	100.00	Using where

Field or reference '_startdate' of SELECT #2 was resolved in SELECT #1
select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`

This query is efficiently optimized by MySQL and takes only 2 ms.

Note that in SELECT clause I first returned the alias: @startdate AS _startdate and then used the alias in the subquery. This is also a hack to make MySQL use a DEPENDENT SUBQUERY (which references fields) instead of UNCACHEABLE SUBQUERY (which references session variables). The former is subject to much more efficient optimization, in particular, it can use an index access (which is what we need here).

The final gap

The query above converts all ranges to the gaps but it does not return the final gap (the one between the last range and the infinity). This is what the FULL OUTER JOIN had been used for in the SQL Server solution.

To return this final gap, we should just append it to our recordset with a UNION ALL. The session variable will preserve its state, that's why we won't need to use any tables at all in the query being appended:

SELECT  *
FROM    (
        SELECT  @startdate AS _startdate,
                (
                SELECT  enddate
                FROM    t_range rp
                WHERE   rp.startdate = _startdate
                ) AS e,
                @startdate := startdate AS s
        FROM    (
                SELECT  @startdate := CAST(NULL AS DATETIME)
                ) vars,
                t_range
        WHERE   startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
        ORDER BY
                startdate
        ) q
UNION ALL
SELECT  @startdate AS _startdate,
        (
        SELECT  enddate
        FROM    t_range rp
        WHERE   rp.startdate = _startdate
        ),
        NULL

View query results

_startdate	e	s
		2008-01-01 00:00:00
2008-01-01 00:00:00	2008-01-01 00:02:00	2008-01-01 00:10:00
2008-01-01 00:10:00	2008-01-01 00:12:00	2008-01-01 00:20:00
2008-01-01 00:20:00	2008-01-01 00:22:00	2008-01-01 00:30:00
2008-01-01 00:30:00	2008-01-01 00:32:00	2008-01-01 00:40:00
2008-01-01 00:40:00	2008-01-01 00:42:00	2008-01-01 00:50:00
2008-01-01 00:50:00	2008-01-01 00:52:00	2008-01-01 01:00:00
2008-01-01 01:00:00	2008-01-01 01:02:00
8 rows fetched in 0.0003s (0.0117s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					7	100.00
2	DERIVED	<derived4>	system					1	100.00
2	DERIVED	t_range	range	ux_range_startdate	ux_range_startdate	8		7	100.00	Using where; Using index
4	DERIVED									No tables used
3	DEPENDENT SUBQUERY	rp	eq_ref	ux_range_startdate	ux_range_startdate	8	func	1	100.00	Using where
5	UNION									No tables used
6	DEPENDENT SUBQUERY	rp	eq_ref	ux_range_startdate	ux_range_startdate	8	func	1	100.00	Using where
	UNION RESULT	<union1,5>	ALL

Field or reference '_startdate' of SELECT #3 was resolved in SELECT #2
Field or reference '_startdate' of SELECT #6 was resolved in SELECT #5
select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `(
        SELECT  enddate
        FROM    t_range rp
        WHERE   rp.startdate = _startdate
        )`,NULL AS `NULL`

This returns the final gap and is almost instant as well.

Performance test

Finally, let's check the performance of the query.

To do this, we will invert the whole query and calculate the total length of all finite gaps:

SELECT  COUNT(*) AS cnt, SUM(TIMEDIFF(s, e))
FROM    (
        SELECT  *
        FROM    (
                SELECT  @startdate AS _startdate,
                        (
                        SELECT  enddate
                        FROM    t_range rp
                        WHERE   rp.startdate = _startdate
                        ) AS e,
                        @startdate := startdate AS s
                FROM    (
                        SELECT  @startdate := CAST(NULL AS DATETIME)
                        ) vars,
                        t_range
                ORDER BY
                        startdate
                ) q
        UNION ALL
        SELECT  @startdate AS _startdate,
                (
                SELECT  enddate
                FROM    t_range rp
                WHERE   rp.startdate = _startdate
                ),
                NULL        
        ) q2

View query results

cnt	SUM(TIMEDIFF(s, e))
100001	79999200.000000
1 row fetched in 0.0001s (2.8281s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					100001	100.00
2	DERIVED	<derived3>	ALL					100000	100.00
3	DERIVED	<derived5>	system					1	100.00
3	DERIVED	t_range	index		ux_range_startdate	8		93607	100.00	Using index
5	DERIVED									No tables used
4	DEPENDENT SUBQUERY	rp	eq_ref	ux_range_startdate	ux_range_startdate	8	func	1	100.00	Using where
6	UNCACHEABLE UNION									No tables used
7	DEPENDENT SUBQUERY	rp	eq_ref	ux_range_startdate	ux_range_startdate	8	func	1	100.00	Using where
	UNION RESULT	<union2,6>	ALL

Field or reference '_startdate' of SELECT #4 was resolved in SELECT #3
Field or reference '_startdate' of SELECT #7 was resolved in SELECT #6
select count(0) AS `cnt`,sum(timediff(`q2`.`s`,`q2`.`e`)) AS `SUM(TIMEDIFF(s, e))` from (select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `(
                SELECT  enddate
                FROM    t_range rp
                WHERE   rp.startdate = _startdate
                )`,NULL AS `NULL`) `q2`

This query transforms a set of 100,000 date ranges into a set 100,001 gaps between them and takes less than 3 seconds.

This is a superb performance for a query like this.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

November 12th, 2009 at 11:00 pm

Posted in MySQL

3 Responses to 'Inverting date ranges: MySQL'

Subscribe to comments with RSS

Good morning,
thx for this part of code, it’s working perfectly for me.
I’d would like to have the same result, but my table is the same than your t’range table,but with an extra field :id_room .
Is it possible to get the same result than previous, ordered by id_room ?
thx a lot.

julien

2 Jul 13 at 11:27
@julien: please do not post long code in the comments. Instead, prepare your setup on http://sqlfiddle.com and post here the link.

Quassnoi

3 Jul 13 at 13:13
ok, sorry for that.

In fact, i have understood that this code can do the same than the link above ( in sqlserver).
But in the link above, there is the idressource in the table, and not in that code in Mysql.

Is it possible to do the same in Mysql with idressource field ?

julien

4 Jul 13 at 04:18

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED