Overlapping ranges: MySQL
Continuing yesterday's post.
From Stack Overflow:
I have an event table that specifies a date range with
start_date
andend_date
fields.I have another date range, specified in code, that defines the current week as
week_start
andweek_end
.I'd like to query all events for the week.
Now, let's see how do it in MySQL.
In the previous post we saw that SQL Server is not very good in optimizing this construct:
start_date < @week_end AND end_date > @week_start
and using it leads to a full table scan. A more complex condition is more efficient in SQL Server:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
, since two different indexes can be applied to both parts of the condition.
However, the second condition is intrinsically not very efficient which is a problem for almost any datetime range analysis.
A plain B-Tree index is efficient for queries like this:
constant1 < column < constant1
, but not for the queries like this:
column1 < constant < column2
The latter condition is in fact rewritten as following by the optimizer:
column1 < constant AND column2 > constant
and the optimizer is forced to choose between two non-optimal options: either stick to one index and do the filtering (using table lookups) or perform two index lookups and merge the results.
If 1,000,000 rows satisfy the first condition and 1,000,000 satisfy the second one, the engine needs to inspect either 1,000,000 rows with table lookups or 2,000,000 rows using indexes, only to find out that the intersection contains but 10 rows.
Fortunately, MySQL supports R-Tree indexes (called SPATIAL in MySQL because they are primarily used to query spatial data).
Indexes of this kind are particularly good for the queries similar to the one above, and with a little effort they can be used to query time data too.
Let's create a sample table:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_event ( id INT NOT NULL PRIMARY KEY, evt_start DATE NOT NULL, evt_end DATE NOT NULL, evt_value VARCHAR(400) NOT NULL, evt_range LINESTRING NOT NULL, KEY ix_event_start (evt_start), KEY ix_event_end (evt_end), SPATIAL KEY sx_event_range (evt_range) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_event SELECT id, evt_start, evt_end, RPAD(CONCAT('Value ', id), 400, ' '), LineString(Point(-1, UNIX_TIMESTAMP(evt_start)), Point(1, UNIX_TIMESTAMP(evt_end))) FROM ( SELECT id, '2009-07-01' - INTERVAL id MINUTE AS evt_start, '2009-07-01' - INTERVAL id MINUTE + INTERVAL FLOOR(RAND(20090701) * 35) + 1 DAY AS evt_end FROM filler ) q;
We have 1,000,000 events here that start every minute and long for a random time from 1 to 35 days.
We also have a helper column, evt_range
, that contains a geometry object, LineString
, that represents a line with coordinates (-1, evt_start), (+1, evt_end)
, evt_start
and evt_end
here being converted to UNIX timestamps.
The table has indexes on evt_start
and evt_end
, and a SPATIAL
index on evt_range
.
First, let's see the performance of the more simple query:
SELECT SUM(LENGTH(evt_value)), COUNT(*) FROM t_event WHERE evt_end >= '2009-06-28' AND evt_start <= '2009-06-28' + INTERVAL 7 DAY
SUM(LENGTH(evt_value)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12102000 | 30255 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.3619s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_event | range | ix_event_start,ix_event_end | ix_event_end | 8 | 33922 | 100.00 | Using where |
0.36 seconds, not that bad.
Unlike SQL Server, MySQL is able to see that the index will improve the efficiency here and use it.
Let's, though, change the condition a little and query for an earlier week:
SELECT SUM(LENGTH(evt_value)) FROM t_event WHERE evt_end >= '2009-05-31' AND evt_start <= '2009-05-31' + INTERVAL 7 DAY
SUM(LENGTH(evt_value)) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14397200 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (1.0625s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_event | range | ix_event_start,ix_event_end | ix_event_end | 8 | 79689 | 100.00 | Using where |
Same execution plan is used, but since there are much more values to test against, this query runs for more than 1 second.
Now, let's try to use the second condition:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
It consists of two parts OR
'ed together.
This part:
(start_date > @week_start AND start_date < @week_end)
can be easily served by the index we defined on evt_start
.
As for this part:
(@week_start BETWEEN start_date AND end_date)
, we need to find out if the @week_start
gets into range between start_date
and end_date
.
We have a LineString
column evt_range
, defined as following:
LineString(Point(-1, UNIX_TIMESTAMP(evt_start)), Point(1, UNIX_TIMESTAMP(evt_end)))
To find out if @week_start
gets inside the range, we need to find out if the minimal bounding box (MBR
) of a point corresponding to @week_start
gets into the minimal bonding box of the LineString
.
This can be easily done using MySQL function MBRContains
:
MBRContains(evt_range, Point(0, UNIX_TIMESTAMP(@week_start)))
, and this construct can use the SPATIAL index.
Now, let's combine these queries together:
SELECT SUM(LENGTH(evt_value)), COUNT(*) FROM t_event WHERE (evt_start > '2009-06-28' AND evt_start <= '2009-06-28' + INTERVAL 7 DAY) OR (MBRContains(evt_range, Point(0, UNIX_TIMESTAMP('2009-06-28'))))
SUM(LENGTH(evt_value)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12102000 | 30255 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.3301s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_event | index_merge | ix_event_start,sx_event_range | ix_event_start,sx_event_range | 8,34 | 4498 | 100.00 | Using sort_union(ix_event_start,sx_event_range); Using where |
We see that this query takes two indexes together and combines them.
The index ix_event_start
serves the first part of the condition, and the SPATIAL index sx_event_range
serves MBRContains
.
This query is just a little more efficient than the original one (still it is more efficient).
Now, let's dig a little bit deeper into the history and query the earlier events:
SELECT SUM(LENGTH(evt_value)), COUNT(*) FROM t_event WHERE (evt_start > '2009-05-31' AND evt_start <= '2009-05-31' + INTERVAL 7 DAY) OR (MBRContains(evt_range, Point(0, UNIX_TIMESTAMP('2009-05-31'))))
SUM(LENGTH(evt_value)) | COUNT(*) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14397200 | 35993 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.3854s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_event | index_merge | ix_event_start,sx_event_range | ix_event_start,sx_event_range | 8,34 | 10497 | 100.00 | Using sort_union(ix_event_start,sx_event_range); Using where |
Same efficiency!
The performance of the SPATIAL index solution does not depend on how long ago the range we are searching for started, and outperforms the simpler query even in the best case.
What is the best way to store time windows and query against them?…
One method I’ve seen used is to map the time windows to spatial geometries. Doing so allows you to easily check for intersection in an efficient manner. Explain Extended has an example of this for MySQL you should be able to easily adapt to MS SQL: ht…
Quora
14 Feb 11 at 06:21