EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

MySQL: grouping continuous ranges

Comments enabled. I *really* need your comment

From Stack Overflow:

Given the following table:

ID State Date
12 1 2009-07-16 10:00
45 2 2009-07-16 13:00
67 2 2009-07-16 14:40
77 1 2009-07-16 15:00
89 1 2009-07-16 15:30
99 1 2009-07-16 16:00

, how can I group by the field state, while still maintaining the borders between the state changes?

I need MIN(id), MIN(date) and COUNT(*) for each group, and this is expected:

ID State Date Count
12 1 2009-07-16 10:00 1
45 2 2009-07-16 13:00 2
77 1 2009-07-16 15:00 3

Though a pure set-based solution exists, in MySQL, this can be easily done using session variables, which is more efficient.

Let’s create a sample table:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_range (
        id INT NOT NULL PRIMARY KEY,
        state INT NOT NULL,
        date DATETIME NOT NULL,
        value VARCHAR(100) NOT NULL,
        KEY ix_range_date_state (date, state)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(10000);
COMMIT;

INSERT
INTO    t_range
SELECT  id,
        (id - 1) DIV 1000 + 1,
        '2009-07-24' - INTERVAL id MINUTE,
        RPAD(CONCAT('Value ', id), 100, ' ')
FROM    filler;

This table has 10,000 rows with 10 consecutive ranges, having states 1 and 2.

Rows from 1 to 1,000 have state = 1, rows from 1,001 to 2,000 have state = 2, next thousand rows have state = 1 again etc.

Dates come in descending order, that is higher ids have lower dates.

We can use a pure set-based approach, that is select the id of the first row, ordered by date DESC then state DESC, which differs from the state of the current row.

This value is actually the last id of the previous continuous range, and it will be the same for all rows in the current range.

Then we can just group on this value:

SELECT  MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, COUNT(*) cnt
FROM    (
        SELECT  id, date, state,
                (
                SELECT  id
                FROM    t_range ri
                WHERE   (ri.date, ri.state, ri.id) < (ro.date, ro.state, ro.id)
                        AND ri.state <> ro.state
                ORDER BY
                        date DESC, state DESC
                LIMIT 1
                ) AS prev
        FROM    t_range ro
        ) q
GROUP BY
        prev
id date state cnt
9001 2009-07-17 01:20:00 2 1000
1 2009-07-23 07:20:00 1 1000
1001 2009-07-22 14:40:00 2 1000
2001 2009-07-21 22:00:00 1 1000
3001 2009-07-21 05:20:00 2 1000
4001 2009-07-20 12:40:00 1 1000
5001 2009-07-19 20:00:00 2 1000
6001 2009-07-19 03:20:00 1 1000
7001 2009-07-18 10:40:00 2 1000
8001 2009-07-17 18:00:00 1 1000
10 rows fetched in 0.0000s (62.7648s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 10000 100.00 Using temporary; Using filesort
2 DERIVED ro index ix_range_date_state 12 9900 100.00 Using index
3 DEPENDENT SUBQUERY ri index ix_range_date_state 12 1 990000.00 Using where; Using index

This runs for 61 seconds, not efficient.

It’s another classical case of Schlemiel the painter’s Algorithm: the further it goes from the beginning of the range, the more records it takes to the beginning of the range in a subquery.

We better keep track of the groups in a session variable.

To do this, we need to:

  • Initialize two session variables: a counter and a previous state holder
  • Select all results ordered by date then state
  • Inrement the session variable whenever the state changes (previous state holder is not equal to the current state)
  • Update the state holder
  • Group by counter

And here’s the query to do this:

SELECT  MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, COUNT(*) cnt
FROM    (
        SELECT  @r := @r + (@state != state OR @state IS NULL) AS gn,
                @state := state AS sn,
                s.id, s.date, s.state
        FROM    (
                SELECT  @r := 0,
                        @state := NULL
                ) vars,
                t_range s
        ORDER BY
                date, state
        ) q
GROUP BY
        gn

id date state cnt
9001 2009-07-17 01:20:00 2 1000
8001 2009-07-17 18:00:00 1 1000
7001 2009-07-18 10:40:00 2 1000
6001 2009-07-19 03:20:00 1 1000
5001 2009-07-19 20:00:00 2 1000
4001 2009-07-20 12:40:00 1 1000
3001 2009-07-21 05:20:00 2 1000
2001 2009-07-21 22:00:00 1 1000
1001 2009-07-22 14:40:00 2 1000
1 2009-07-23 07:20:00 1 1000
10 rows fetched in 0.0005s (0.0335s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 10000 100.00 Using temporary; Using filesort
2 DERIVED <derived3> system 1 100.00
2 DERIVED s index ix_range_date_state 12 9900 100.00 Using index
3 DERIVED No tables used

This is only 30 ms, next to instant.

Written by Quassnoi

July 24th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply