Answering questions asked on the site.
Maxym asks:
Inverting date ranges is an interesting article.
I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.
Do you have any idea how to transform this query for MySQL?
The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.
The following set:
| StartDate |
EndDate |
| 2009-11-12 |
2009-11-13 |
| 2009-11-15 |
2009-11-19 |
should become this:
| StartDate |
EndDate |
NULL |
2009-11-12 |
| 2009-11-13 |
2009-11-15 |
| 2009-11-19 |
NULL |
The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.
However, the query in question relies on FULL OUTER JOIN and ROW_NUMBER. MySQL lacks both of these.
To implement this query in MySQL we need to use a certain workaround.
Let’s create a sample table:
Table creation details
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_range (
id INT NOT NULL PRIMARY KEY,
startdate DATETIME NOT NULL,
enddate DATETIME NOT NULL,
data VARCHAR(20) NOT NULL,
UNIQUE KEY ux_range_startdate (startdate)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(100000);
COMMIT;
INSERT
INTO t_range
SELECT id,
'2009-11-12' - INTERVAL 10 * id MINUTE,
'2009-11-12' - INTERVAL 10 * id - 2 MINUTE,
CONCAT('Range ', id)
FROM filler;
This table contains 100,000 non-overlapping ranges. The field startdate is indexed with a UNIQUE index (since the ranges should be non-overlapping, this is a good thing to do).
We should make a shifted
join here, that is join each record with the previous one. To do this in SQL Server and other systems, ROW_NUMBER() can be employed which makes the join condition easy (rn = rn + 1) and efficient even for large datasets (since it’s able to use a HASH JOIN or a MERGE JOIN).
MySQL does not support ROW_NUMBER. But even if it did, MySQL is capable of using neither HASH JOIN nor MERGE JOIN anyway, and a JOIN BUFFER on a non-indexed column would take forever.
Subquery
We can use a subquery to find the previous record and join with it:
SELECT rp.enddate, rc.startdate
FROM t_range rc
JOIN t_range rp
ON rp.startdate =
(
SELECT rn.startdate
FROM t_range rn
WHERE rn.startdate < rc.startdate
ORDER BY
rn.startdate DESC
LIMIT 1
)
WHERE rc.startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
View query results
| enddate |
startdate |
| 2007-12-31 23:52:00 |
2008-01-01 00:00:00 |
| 2008-01-01 00:02:00 |
2008-01-01 00:10:00 |
| 2008-01-01 00:12:00 |
2008-01-01 00:20:00 |
| 2008-01-01 00:22:00 |
2008-01-01 00:30:00 |
| 2008-01-01 00:32:00 |
2008-01-01 00:40:00 |
| 2008-01-01 00:42:00 |
2008-01-01 00:50:00 |
| 2008-01-01 00:52:00 |
2008-01-01 01:00:00 |
| 7 rows fetched in 0.0004s (1.4687s) |
| id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
| 1 |
PRIMARY |
rc |
range |
ux_range_startdate |
ux_range_startdate |
8 |
|
7 |
100.00 |
Using where; Using index |
| 1 |
PRIMARY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
| 2 |
DEPENDENT SUBQUERY |
rn |
index |
ux_range_startdate |
ux_range_startdate |
8 |
|
1 |
9360700.00 |
Using where; Using index |
Field or reference '20091112_invert.rc.startdate' of SELECT #2 was resolved in SELECT #1
select `20091112_invert`.`rp`.`enddate` AS `enddate`,`20091112_invert`.`rc`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rc` join `20091112_invert`.`t_range` `rp` where ((`20091112_invert`.`rc`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') and (`20091112_invert`.`rp`.`startdate` = (select `20091112_invert`.`rn`.`startdate` AS `startdate` from `20091112_invert`.`t_range` `rn` where (`20091112_invert`.`rn`.`startdate` < `20091112_invert`.`rc`.`startdate`) order by `20091112_invert`.`rn`.`startdate` desc limit 1)))
This solution works, however, there is a little problem with it. MySQL does not optimize the range access methods in dependent subqueries. Each time the subquery is evaluated, full index scan is used, and this query takes almost 1.5 seconds even on a tiny range of 7 records. Had the query been run on the whole table, it would take forever.
Emulating LAG using session variables
To work around this problem, we should emulate the window function LAG which is present in Oracle.
To do this, we should use the approach described in the article about emulating LAG in MySQL that I wrote some time ago:
SELECT @startdate AS _startdate,
(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
) AS e,
@startdate := startdate AS s
FROM (
SELECT @startdate := CAST(NULL AS DATETIME)
) vars,
t_range
WHERE startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
ORDER BY
startdate
View query results
| _startdate |
e |
s |
|
|
2008-01-01 00:00:00 |
| 2008-01-01 00:00:00 |
2008-01-01 00:02:00 |
2008-01-01 00:10:00 |
| 2008-01-01 00:10:00 |
2008-01-01 00:12:00 |
2008-01-01 00:20:00 |
| 2008-01-01 00:20:00 |
2008-01-01 00:22:00 |
2008-01-01 00:30:00 |
| 2008-01-01 00:30:00 |
2008-01-01 00:32:00 |
2008-01-01 00:40:00 |
| 2008-01-01 00:40:00 |
2008-01-01 00:42:00 |
2008-01-01 00:50:00 |
| 2008-01-01 00:50:00 |
2008-01-01 00:52:00 |
2008-01-01 01:00:00 |
| 7 rows fetched in 0.0003s (0.0020s) |
| id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
| 1 |
PRIMARY |
<derived3> |
system |
|
|
|
|
1 |
100.00 |
|
| 1 |
PRIMARY |
t_range |
range |
ux_range_startdate |
ux_range_startdate |
8 |
|
7 |
100.00 |
Using where; Using index |
| 3 |
DERIVED |
|
|
|
|
|
|
|
|
No tables used |
| 2 |
DEPENDENT SUBQUERY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
Field or reference '_startdate' of SELECT #2 was resolved in SELECT #1
select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`
This query is efficiently optimized by MySQL and takes only 2 ms.
Note that in SELECT clause I first returned the alias: @startdate AS _startdate and then used the alias in the subquery. This is also a hack to make MySQL use a DEPENDENT SUBQUERY (which references fields) instead of UNCACHEABLE SUBQUERY (which references session variables). The former is subject to much more efficient optimization, in particular, it can use an index access (which is what we need here).
The final gap
The query above converts all ranges to the gaps but it does not return the final gap (the one between the last range and the infinity). This is what the FULL OUTER JOIN had been used for in the SQL Server solution.
To return this final gap, we should just append it to our recordset with a UNION ALL. The session variable will preserve its state, that’s why we won’t need to use any tables at all in the query being appended:
SELECT *
FROM (
SELECT @startdate AS _startdate,
(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
) AS e,
@startdate := startdate AS s
FROM (
SELECT @startdate := CAST(NULL AS DATETIME)
) vars,
t_range
WHERE startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
ORDER BY
startdate
) q
UNION ALL
SELECT @startdate AS _startdate,
(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
),
NULL
View query results
| _startdate |
e |
s |
|
|
2008-01-01 00:00:00 |
| 2008-01-01 00:00:00 |
2008-01-01 00:02:00 |
2008-01-01 00:10:00 |
| 2008-01-01 00:10:00 |
2008-01-01 00:12:00 |
2008-01-01 00:20:00 |
| 2008-01-01 00:20:00 |
2008-01-01 00:22:00 |
2008-01-01 00:30:00 |
| 2008-01-01 00:30:00 |
2008-01-01 00:32:00 |
2008-01-01 00:40:00 |
| 2008-01-01 00:40:00 |
2008-01-01 00:42:00 |
2008-01-01 00:50:00 |
| 2008-01-01 00:50:00 |
2008-01-01 00:52:00 |
2008-01-01 01:00:00 |
| 2008-01-01 01:00:00 |
2008-01-01 01:02:00 |
|
| 8 rows fetched in 0.0003s (0.0117s) |
| id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
| 1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
7 |
100.00 |
|
| 2 |
DERIVED |
<derived4> |
system |
|
|
|
|
1 |
100.00 |
|
| 2 |
DERIVED |
t_range |
range |
ux_range_startdate |
ux_range_startdate |
8 |
|
7 |
100.00 |
Using where; Using index |
| 4 |
DERIVED |
|
|
|
|
|
|
|
|
No tables used |
| 3 |
DEPENDENT SUBQUERY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
| 5 |
UNION |
|
|
|
|
|
|
|
|
No tables used |
| 6 |
DEPENDENT SUBQUERY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
|
UNION RESULT |
<union1,5> |
ALL |
|
|
|
|
|
|
|
Field or reference '_startdate' of SELECT #3 was resolved in SELECT #2
Field or reference '_startdate' of SELECT #6 was resolved in SELECT #5
select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` where (`20091112_invert`.`t_range`.`startdate` between '2008-01-01 00:00:00' and '2008-01-01 01:00:00') order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
)`,NULL AS `NULL`
This returns the final gap and is almost instant as well.
Performance test
Finally, let’s check the performance of the query.
To do this, we will invert the whole query and calculate the total length of all finite gaps:
SELECT COUNT(*) AS cnt, SUM(TIMEDIFF(s, e))
FROM (
SELECT *
FROM (
SELECT @startdate AS _startdate,
(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
) AS e,
@startdate := startdate AS s
FROM (
SELECT @startdate := CAST(NULL AS DATETIME)
) vars,
t_range
ORDER BY
startdate
) q
UNION ALL
SELECT @startdate AS _startdate,
(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
),
NULL
) q2
View query results
| cnt |
SUM(TIMEDIFF(s, e)) |
| 100001 |
79999200.000000 |
| 1 row fetched in 0.0001s (2.8281s) |
| id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
| 1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
100001 |
100.00 |
|
| 2 |
DERIVED |
<derived3> |
ALL |
|
|
|
|
100000 |
100.00 |
|
| 3 |
DERIVED |
<derived5> |
system |
|
|
|
|
1 |
100.00 |
|
| 3 |
DERIVED |
t_range |
index |
|
ux_range_startdate |
8 |
|
93607 |
100.00 |
Using index |
| 5 |
DERIVED |
|
|
|
|
|
|
|
|
No tables used |
| 4 |
DEPENDENT SUBQUERY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
| 6 |
UNCACHEABLE UNION |
|
|
|
|
|
|
|
|
No tables used |
| 7 |
DEPENDENT SUBQUERY |
rp |
eq_ref |
ux_range_startdate |
ux_range_startdate |
8 |
func |
1 |
100.00 |
Using where |
|
UNION RESULT |
<union2,6> |
ALL |
|
|
|
|
|
|
|
Field or reference '_startdate' of SELECT #4 was resolved in SELECT #3
Field or reference '_startdate' of SELECT #7 was resolved in SELECT #6
select count(0) AS `cnt`,sum(timediff(`q2`.`s`,`q2`.`e`)) AS `SUM(TIMEDIFF(s, e))` from (select `q`.`_startdate` AS `_startdate`,`q`.`e` AS `e`,`q`.`s` AS `s` from (select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `e`,(@startdate:=`20091112_invert`.`t_range`.`startdate`) AS `s` from (select (@startdate:=cast(NULL as datetime)) AS `@startdate := CAST(NULL AS DATETIME)`) `vars` join `20091112_invert`.`t_range` order by `20091112_invert`.`t_range`.`startdate`) `q` union all select (@startdate) AS `_startdate`,(select `20091112_invert`.`rp`.`enddate` AS `enddate` from `20091112_invert`.`t_range` `rp` where (`20091112_invert`.`rp`.`startdate` = `_startdate`)) AS `(
SELECT enddate
FROM t_range rp
WHERE rp.startdate = _startdate
)`,NULL AS `NULL`) `q2`
This query transforms a set of 100,000 date ranges into a set 100,001 gaps between them and takes less than 3 seconds.
This is a superb performance for a query like this.
Hope that helps.
I’m always glad to answer the questions regarding database queries.
Ask me a question