EXPLAIN EXTENDED

How to create fast database queries

Flattening timespans: MySQL

Comments enabled. I *really* need your comment

In the previous articles:

, I described how to merge intersecting timespans into several contiguous ranges.

In SQL Server, this task required a table-valued function (TVF) and a cursor.

In Oracle, we had to employ window functions to do this.

Now let's see how to solve this task in MySQL.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

MySQL offers easy use of session variables. These are loosely typed variables defined within the session scope which you can assign and evaluate right inside the query.

Using these variables may come in handy for tasks like this.

In MySQL, just like in previous solutions, we will need to order the timespans by start anc calculate the running maximum of the stop.

This is easily done using session variable @edate which we will use to store running maximum. We will update in in the SELECT clause.

We will also declare another variable @r, initialize it with a zero and use it to keep a range number.

If we have a timespan whose start exceeds the previous value of the running maximum then all timespans that had started eariler had also been stopped by this time, and we have a gap.

The current timespan, therefore, belongs to another range, and we need to increment the value of @r.

Finally, we will just group the timespans on the range number and return MIN(start) and MAX(stop) for all timespans in a range. Easy to see that these will be the start and the stop for the whole range.

Here is the sample table:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_span (
        s_start DATETIME NOT NULL,
        s_stop DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(720000);
COMMIT;

INSERT
INTO    t_span (s_start, s_stop)
SELECT  CAST('2009-06-13' AS DATE) - INTERVAL id * 30 SECOND,
        CAST('2009-06-13' AS DATE) - INTERVAL id * 30 SECOND + INTERVAL 15 + RAND(20090613) * 300 SECOND
FROM    (
        SELECT  id
        FROM    filler
        UNION ALL
        SELECT  id + 720000
        FROM    filler
        ) q;
&#91;/sourcecode&#93;

Again, we have <strong>1,440,000</strong> timespans, starting each <strong>30</strong> seconds and having random length from <strong>15</strong> to <strong>315</strong> seconds.

And here's the query:


SELECT  MIN(s_start) AS r_start, MAX(s_stop) AS r_stop
FROM    (
        SELECT  s_start,
                s_stop,
                @r := @r + (s_start > @edate) AS r_num,
                @edate := GREATEST(@edate, s_stop)
        FROM    (
                SELECT  @r := 0,
                        @edate := '0001-01-01'
                ) vars,
                t_span
        ORDER BY
                s_start
        ) q
GROUP BY
        r_num

Note the use of the following construct: @r := @r + (s_start > @edate) AS r_num.

MySQL can cast boolean's to integer's implicitly.

This means that we increment @r either by 0 (if the current s_start is not greater than the previous running maximum of s_stop), or by 1 (if the current s_start is greater than the running maximum).

In other words, we increment @r whenever we find a gap so that the first timespan after the gap is assigned with the new value of @r.

Here's the result of the query:

r_start r_stop
2008-01-30 00:00:00 2008-02-07 19:16:56
2008-02-07 19:17:00 2008-02-11 10:52:18
2008-02-11 10:52:30 2008-02-22 12:12:57
2008-02-22 12:13:00 2008-02-22 12:13:25
2008-02-22 12:13:30 2008-02-22 12:13:58
2008-02-22 12:14:00 2008-02-22 17:25:48
2008-02-22 17:26:00 2008-03-01 01:23:18
2008-03-01 01:23:30 2008-03-02 04:22:23
2008-03-02 04:22:30 2008-03-02 04:22:56
2008-03-02 04:23:00 2008-03-20 01:29:59
2009-06-09 10:31:00 2009-06-11 10:09:46
2009-06-11 10:10:00 2009-06-12 02:41:17
2009-06-12 02:41:30 2009-06-13 00:04:13
137 rows fetched in 0.0031s (11.3437s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 1440000 100.00 Using temporary; Using filesort
2 DERIVED <derived3> system 1 100.00 Using filesort
2 DERIVED t_span ALL 1440277 100.00
3 DERIVED No tables used

It's much more simple and elegant than same solutions for SQL Server and Oracle.

Besides, it just runs faster.

Written by Quassnoi

June 13th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply