EXPLAIN EXTENDED

How to create fast database queries

Overlapping ranges: MySQL

with one comment

Continuing yesterday's post.

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Now, let's see how do it in MySQL.

In the previous post we saw that SQL Server is not very good in optimizing this construct:

start_date < @week_end AND end_date > @week_start

and using it leads to a full table scan. A more complex condition is more efficient in SQL Server:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

, since two different indexes can be applied to both parts of the condition.

However, the second condition is intrinsically not very efficient which is a problem for almost any datetime range analysis.

A plain B-Tree index is efficient for queries like this:

constant1 < column < constant1

, but not for the queries like this:

column1 < constant < column2

The latter condition is in fact rewritten as following by the optimizer:

column1 < constant AND column2 > constant

and the optimizer is forced to choose between two non-optimal options: either stick to one index and do the filtering (using table lookups) or perform two index lookups and merge the results.

If 1,000,000 rows satisfy the first condition and 1,000,000 satisfy the second one, the engine needs to inspect either 1,000,000 rows with table lookups or 2,000,000 rows using indexes, only to find out that the intersection contains but 10 rows.

Fortunately, MySQL supports R-Tree indexes (called SPATIAL in MySQL because they are primarily used to query spatial data).

Indexes of this kind are particularly good for the queries similar to the one above, and with a little effort they can be used to query time data too.

Let's create a sample table:

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_event
(
id INT NOT NULL PRIMARY KEY,
evt_start DATE NOT NULL,
evt_end DATE NOT NULL,
evt_value VARCHAR(400) NOT NULL,
evt_range LINESTRING NOT NULL,
KEY ix_event_start (evt_start),
KEY ix_event_end (evt_end),
SPATIAL KEY sx_event_range (evt_range)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_event
SELECT  id,
        evt_start,
        evt_end,
        RPAD(CONCAT('Value ', id), 400, ' '),
        LineString(Point(-1, UNIX_TIMESTAMP(evt_start)), Point(1, UNIX_TIMESTAMP(evt_end)))
FROM    (
        SELECT  id,
                '2009-07-01' - INTERVAL id MINUTE AS evt_start,
                '2009-07-01' - INTERVAL id MINUTE + INTERVAL FLOOR(RAND(20090701) * 35) + 1 DAY AS evt_end
        FROM    filler
        ) q;

We have 1,000,000 events here that start every minute and long for a random time from 1 to 35 days.

We also have a helper column, evt_range, that contains a geometry object, LineString, that represents a line with coordinates (-1, evt_start), (+1, evt_end), evt_start and evt_end here being converted to UNIX timestamps.

The table has indexes on evt_start and evt_end, and a SPATIAL index on evt_range.

First, let's see the performance of the more simple query:

SELECT  SUM(LENGTH(evt_value)), COUNT(*)
FROM    t_event
WHERE   evt_end &gt;= '2009-06-28'
AND evt_start &lt;= '2009-06-28' + INTERVAL 7 DAY

SUM(LENGTH(evt_value)) COUNT(*)
12102000 30255
1 row fetched in 0.0001s (0.3619s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_event range ix_event_start,ix_event_end ix_event_end 8 33922 100.00 Using where

0.36 seconds, not that bad.

Unlike SQL Server, MySQL is able to see that the index will improve the efficiency here and use it.

Let's, though, change the condition a little and query for an earlier week:

SELECT  SUM(LENGTH(evt_value))
FROM    t_event
WHERE   evt_end &gt;= '2009-05-31'
AND evt_start &lt;= '2009-05-31' + INTERVAL 7 DAY

SUM(LENGTH(evt_value))
14397200
1 row fetched in 0.0001s (1.0625s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_event range ix_event_start,ix_event_end ix_event_end 8 79689 100.00 Using where

Same execution plan is used, but since there are much more values to test against, this query runs for more than 1 second.

Now, let's try to use the second condition:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

It consists of two parts OR'ed together.

This part:

(start_date > @week_start AND start_date < @week_end)

can be easily served by the index we defined on evt_start.

As for this part:

(@week_start BETWEEN start_date AND end_date)

, we need to find out if the @week_start gets into range between start_date and end_date.

We have a LineString column evt_range, defined as following:

LineString(Point(-1, UNIX_TIMESTAMP(evt_start)), Point(1, UNIX_TIMESTAMP(evt_end)))

To find out if @week_start gets inside the range, we need to find out if the minimal bounding box (MBR) of a point corresponding to @week_start gets into the minimal bonding box of the LineString.

This can be easily done using MySQL function MBRContains:

MBRContains(evt_range, Point(0, UNIX_TIMESTAMP(@week_start)))

, and this construct can use the SPATIAL index.

Now, let's combine these queries together:

SELECT  SUM(LENGTH(evt_value)), COUNT(*)
FROM    t_event
WHERE   (evt_start &gt; '2009-06-28' AND evt_start &lt;= '2009-06-28' + INTERVAL 7 DAY)
        OR (MBRContains(evt_range, Point(0, UNIX_TIMESTAMP('2009-06-28'))))

SUM(LENGTH(evt_value)) COUNT(*)
12102000 30255
1 row fetched in 0.0001s (0.3301s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_event index_merge ix_event_start,sx_event_range ix_event_start,sx_event_range 8,34 4498 100.00 Using sort_union(ix_event_start,sx_event_range); Using where

We see that this query takes two indexes together and combines them.

The index ix_event_start serves the first part of the condition, and the SPATIAL index sx_event_range serves MBRContains.

This query is just a little more efficient than the original one (still it is more efficient).

Now, let's dig a little bit deeper into the history and query the earlier events:

SELECT  SUM(LENGTH(evt_value)), COUNT(*)
FROM    t_event
WHERE   (evt_start &gt; '2009-05-31' AND evt_start &lt;= '2009-05-31' + INTERVAL 7 DAY)
        OR (MBRContains(evt_range, Point(0, UNIX_TIMESTAMP('2009-05-31'))))

SUM(LENGTH(evt_value)) COUNT(*)
14397200 35993
1 row fetched in 0.0001s (0.3854s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_event index_merge ix_event_start,sx_event_range ix_event_start,sx_event_range 8,34 10497 100.00 Using sort_union(ix_event_start,sx_event_range); Using where

Same efficiency!

The performance of the SPATIAL index solution does not depend on how long ago the range we are searching for started, and outperforms the simpler query even in the best case.

Written by Quassnoi

July 1st, 2009 at 11:00 pm

Posted in MySQL

One Response to 'Overlapping ranges: MySQL'

Subscribe to comments with RSS

  1. What is the best way to store time windows and query against them?…

    One method I’ve seen used is to map the time windows to spatial geometries. Doing so allows you to easily check for intersection in an efficient manner. Explain Extended has an example of this for MySQL you should be able to easily adapt to MS SQL: ht…

    Quora

    14 Feb 11 at 06:21

Leave a Reply