EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Inverting date ranges: MySQL

Answering questions asked on the site.

Maxym asks:

Inverting date ranges is an interesting article.

I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.

Do you have any idea how to transform this query for MySQL?

The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.

The following set:

StartDate EndDate
2009-11-12 2009-11-13
2009-11-15 2009-11-19

should become this:

StartDate EndDate
NULL 2009-11-12
2009-11-13 2009-11-15
2009-11-19 NULL

The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.

However, the query in question relies on FULL OUTER JOIN and ROW_NUMBER. MySQL lacks both of these.

To implement this query in MySQL we need to use a certain workaround.

Let’s create a sample table:

Table creation details

This table contains 100,000 non-overlapping ranges. The field startdate is indexed with a UNIQUE index (since the ranges should be non-overlapping, this is a good thing to do).

We should make a shifted join here, that is join each record with the previous one. To do this in SQL Server and other systems, ROW_NUMBER() can be employed which makes the join condition easy (rn = rn + 1) and efficient even for large datasets (since it’s able to use a HASH JOIN or a MERGE JOIN).

MySQL does not support ROW_NUMBER. But even if it did, MySQL is capable of using neither HASH JOIN nor MERGE JOIN anyway, and a JOIN BUFFER on a non-indexed column would take forever.

Subquery

We can use a subquery to find the previous record and join with it:

SELECT  rp.enddate, rc.startdate
FROM    t_range rc
JOIN    t_range rp
ON      rp.startdate =
        (
        SELECT  rn.startdate
        FROM    t_range rn
        WHERE   rn.startdate < rc.startdate
        ORDER BY
                rn.startdate DESC
        LIMIT 1
        )
WHERE   rc.startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'

View query results

This solution works, however, there is a little problem with it. MySQL does not optimize the range access methods in dependent subqueries. Each time the subquery is evaluated, full index scan is used, and this query takes almost 1.5 seconds even on a tiny range of 7 records. Had the query been run on the whole table, it would take forever.

Emulating LAG using session variables

To work around this problem, we should emulate the window function LAG which is present in Oracle.

To do this, we should use the approach described in the article about emulating LAG in MySQL that I wrote some time ago:

SELECT  @startdate AS _startdate,
        (
        SELECT  enddate
        FROM    t_range rp
        WHERE   rp.startdate = _startdate
        ) AS e,
        @startdate := startdate AS s
FROM    (
        SELECT  @startdate := CAST(NULL AS DATETIME)
        ) vars,
        t_range
WHERE   startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
ORDER BY
        startdate

View query results

This query is efficiently optimized by MySQL and takes only 2 ms.

Note that in SELECT clause I first returned the alias: @startdate AS _startdate and then used the alias in the subquery. This is also a hack to make MySQL use a DEPENDENT SUBQUERY (which references fields) instead of UNCACHEABLE SUBQUERY (which references session variables). The former is subject to much more efficient optimization, in particular, it can use an index access (which is what we need here).

The final gap

The query above converts all ranges to the gaps but it does not return the final gap (the one between the last range and the infinity). This is what the FULL OUTER JOIN had been used for in the SQL Server solution.

To return this final gap, we should just append it to our recordset with a UNION ALL. The session variable will preserve its state, that’s why we won’t need to use any tables at all in the query being appended:

SELECT  *
FROM    (
        SELECT  @startdate AS _startdate,
                (
                SELECT  enddate
                FROM    t_range rp
                WHERE   rp.startdate = _startdate
                ) AS e,
                @startdate := startdate AS s
        FROM    (
                SELECT  @startdate := CAST(NULL AS DATETIME)
                ) vars,
                t_range
        WHERE   startdate BETWEEN '2008-01-01 00:00:00' AND '2008-01-01 01:00:00'
        ORDER BY
                startdate
        ) q
UNION ALL
SELECT  @startdate AS _startdate,
        (
        SELECT  enddate
        FROM    t_range rp
        WHERE   rp.startdate = _startdate
        ),
        NULL

View query results

This returns the final gap and is almost instant as well.

Performance test

Finally, let’s check the performance of the query.

To do this, we will invert the whole query and calculate the total length of all finite gaps:

SELECT  COUNT(*) AS cnt, SUM(TIMEDIFF(s, e))
FROM    (
        SELECT  *
        FROM    (
                SELECT  @startdate AS _startdate,
                        (
                        SELECT  enddate
                        FROM    t_range rp
                        WHERE   rp.startdate = _startdate
                        ) AS e,
                        @startdate := startdate AS s
                FROM    (
                        SELECT  @startdate := CAST(NULL AS DATETIME)
                        ) vars,
                        t_range
                ORDER BY
                        startdate
                ) q
        UNION ALL
        SELECT  @startdate AS _startdate,
                (
                SELECT  enddate
                FROM    t_range rp
                WHERE   rp.startdate = _startdate
                ),
                NULL
        ) q2

View query results

This query transforms a set of 100,000 date ranges into a set 100,001 gaps between them and takes less than 3 seconds.

This is a superb performance for a query like this.

Hope that helps.


I’m always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

November 12th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.