This post is inspired by a discussion with John Didion:
Is there any way to optimize the query for overlapping ranges in MySQL if both ranges are dynamic?
I have two tables, each with integer range columns (specified as
LineString), and I want to find rows that overlap.
No matter what I try, the query planner never uses any indexes.
This question addresses a well-known problem of efficient searching for the intersecting intervals. The queries that deal with it require ability to search for the intervals (stored in two distinct columns) containing a constant scalar value.
Plain B-Tree indexes used by most databases do not speed up the queries like that. However, MySQL supports
SPATIAL indexes that can index two-dimensional shapes and efficiently search for the shapes containing a given point.
With a little effort, time intervals can be converted into the geometrical objects, indexed with a
SPATIAL index and searched for the given point in time (also presented as a gemetrical object). This is described in the article about overlapping ranges in MySQL.
The query described in that article, however, searches for the intervals overlapping a constant range, provided as a parameter to the query. Now, I will discuss how to adapt the query for a
JOIN between two tables.
Let's create two sample tables, each containing a set of time ranges stored as geometrical objects, and find all records from both tables whose ranges overlap:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_big ( id INT NOT NULL PRIMARY KEY, rg LineString NOT NULL, SPATIAL KEY sx_big_rg (rg) ) ENGINE=MyISAM; CREATE TABLE t_small ( id INT NOT NULL PRIMARY KEY, rg LineString NOT NULL, SPATIAL KEY sx_small_rg (rg) ) ENGINE=MyISAM; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(120000); COMMIT; INSERT INTO t_big (id, rg) SELECT id, LineString( Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)), Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end)) ) FROM ( SELECT id, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR AS range_start, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR + INTERVAL RAND(20100201) * 7200 SECOND AS range_end FROM filler ) q; INSERT INTO t_small (id, rg) SELECT id, LineString( Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)), Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end)) ) FROM ( SELECT id, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY AS range_start, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY + INTERVAL RAND(20100201) * 172800 SECOND AS range_end FROM filler ORDER BY id LIMIT 5000 ) q
There are two tables,
t_big contains 120,000 records with intervals starting each hour and having random length from
2 hours, 1 hour in average.
t_small contains 5,000 records with intervals starting each day and having random length from
2 days, 1 day in average.
The intervals are presented as the
LineString records with a single line connecting
Point(-1, range_start) and
range_end are presented as the UNIX timestamps (numbers of seconds since Jan 1st, 1970). To avoid year 2038 problem, we use
TIMESTAMPDIFF rather than
UNIX_TIMESTAMP to calculate the value.
Each table is indexes with a
SPATIAL index on the
First of all, we need to make sure that the index is created well.
MySQL uses R-Tree for the
SPATIAL indexes, which works according to the principle of
least enlargement to the bounding box when adding a value to the index. Each leaf node is covered by a certain minimum bounding box stored in the branch node, and the branch which needs to be increased to the least extent is chosen as a new parent for the element.
Our ranges are one-dimensional, but MySQL requires the spatial data to be two-dimensional. That's why we need to use either of the dimensions to represent the time ranges and fill another one to span some constant range so that the box areas would be proportional to the range lengths.
However, if we just fill both ends of the
LineString with a constant, this will make the lines pure horizontal or pure vertical. Since they all reside on one line, the minimal bounding boxes of any set of the
LineString's will have zero area (since they all will be of zero height). MySQL will not be able to choose the least enlargement: in a two-dimensional space, any horizontal enlargement of a one-dimensional line will be a zero. MySQL will just append the entry to any randomly chosen branch in no certain order. This makes the index very unbalanced and in fact unusable.
To make the index usable, we need to make the boxes have positive area so that the least enlargement principle could do its job. That's why we create the
LineString fields diagonally, with the first coordinates being -1 for the start point and 1 for the end point.
Now, we should make a query that would find the intersections of these ranges.
MySQL documentation evasively mentions that the predicates supported by the spatial indexes are
MBRWithin (with the latter being just the synonym for
MBRContains with the argument order reversed), with a side note that
in future releases, spatial indexes may also be used for optimizing other functions
If we were to comply with the documentation and limit ourselves to using only these two functions for the spatial indexes, the query would still be possible and quite efficient.
MBRContains checks if the bounding box of the second argument lies within the bounding box for the first argument. This is not always true for the overlapping ranges. If the ranges overlap partially, that is the first range both starts and ends earlier than the second range, then no ranges lie completely within each other's bounds, and
MBRContains will return 0 for any argument order.
However, there is a simple condition that checks whether the ranges do overlap.
With any pair of ranges, there is a range that starts later (or at the same time) than another. If the ranges overlap, then the start of that range should lie within the bounds of its counterpart. Indeed, the start of the latest range should be equal to or greater then the start of the earliest range (by definition) and equal to of less than the end of the earliest range (or the ranges do not overlap).
This can be checked with a simple condition:
Since there is no sargable condition that would find out which range starts earlier, we'll just need to check both ways.
Unfortunately, MySQL does not support merging spatial indexes, so an
OR condition would make the indexes unusable. We need to use the two
MBRContains predicates with two separate queries and merge the results using
When optimizing a join, MySQL's optimizer usually makes the smaller table leading in the join. With a usual equijoin condition this is a wise thing to do, since MySQL uses nested loops with an index search to join the tables. The duration of the nested loops query is the product of the leading table scan time (the number of the records) and driven table search time. B-Tree search is logarithmic, so
(n × log m) is less than
(m × log n) as long as
m < n. That's why the leading table should be the smallest one.
However, MySQL does not take into account that
MBRContains predicate is not symmetrical. Reversing the table scan order makes each search operation to be not a searching for boxes containing the point, but searching for start points contained by the box. Since the individual points are not even indexed, MySQL will not be able to use an index access path for such a query and will revert to a very inefficient
join buffer (which in fact is comparing every record to every other record).
To work around this, we need to force the join order so that the table with the points leads.
In MySQL, this is done by replacing
STRAIGHT_JOIN, which forces the leftmost table to be leading in the join.
Here's the query:
SELECT COUNT(*) FROM ( SELECT s.id AS sid, b.id AS bid FROM t_small s STRAIGHT_JOIN t_big b ON MBRContains(b.rg, StartPoint(s.rg)) UNION SELECT s.id AS sid, b.id AS bid FROM t_big b STRAIGHT_JOIN t_small s ON MBRContains(s.rg, StartPoint(b.rg)) ) q
|1 row fetched in 0.0001s (4.4999s)|
|1||PRIMARY||Select tables optimized away|
|2||DERIVED||b||range||sx_big_rg||sx_big_rg||34||1||12000000.00||Range checked for each record (index map: 0x2)|
|3||UNION||s||range||sx_small_rg||sx_small_rg||34||1||500000.00||Range checked for each record (index map: 0x2)|
select count(0) AS `COUNT(*)` from (select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` union select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_big` `b` straight_join `20100201_ranges`.`t_small` `s`) `q`
This query finds the total number of pairs of overlapping ranges from both tables and runs for 4.5 seconds which is quite efficient.
MBRIntersects: a more efficient solution
Normally I would put the words
hope that helps and a link to the question form here, which I use to close the articles with the answers.
But when testing the solution, John found out that
MBRIntersects function (which just checks intersection of two boxes without splitting them into start and end points) is sargable too and the query above is quite overcomplicated.
This function is symmetric, that is
MBRIntersects(a, b) ≡ MBRIntersects(b, a). However, MySQL is not aware of that, and an index is used only against the column in the first argument to
MBRIntersects. This means we have to make
t_small to lead in the join and put its column into the second argument of the function.
This way, the column from the driven table (
t_big) will be searched for the value provided from the leading table using the index on the driven table, which is exactly what we need.
Here's the new updated query:
SELECT COUNT(*) FROM t_small s STRAIGHT_JOIN t_big b ON MBRIntersects(b.rg, s.rg)
|1 row fetched in 0.0001s (1.2656s)|
|1||SIMPLE||b||ALL||sx_big_rg||120000||100.00||Range checked for each record (index map: 0x2)|
select count(0) AS `COUNT(*)` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` where intersects(`20100201_ranges`.`b`.`rg`,`20100201_ranges`.`s`.`rg`)
As we can see, this query is much more simple and completes 3 times as fast.
Finally, let's see how can we retrieve the datetime values of the start and end of the ranges from the
SELECT s.id AS small_id, '1970-01-01' + INTERVAL Y(StartPoint(s.rg)) SECOND AS small_start, '1970-01-01' + INTERVAL Y(EndPoint(s.rg)) SECOND AS small_end, b.id AS big_id, '1970-01-01' + INTERVAL Y(StartPoint(b.rg)) SECOND AS big_start, '1970-01-01' + INTERVAL Y(EndPoint(b.rg)) SECOND AS big_end FROM t_small s STRAIGHT_JOIN t_big b ON MBRIntersects(b.rg, s.rg) LIMIT 10
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||1||2009-01-31 23:00:00||2009-02-01 00:52:25|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||2||2009-01-31 22:00:00||2009-01-31 22:01:55|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||3||2009-01-31 21:00:00||2009-01-31 21:32:21|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||4||2009-01-31 20:00:00||2009-01-31 20:36:01|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||5||2009-01-31 19:00:00||2009-01-31 20:23:00|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||6||2009-01-31 18:00:00||2009-01-31 19:06:56|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||7||2009-01-31 17:00:00||2009-01-31 18:25:41|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||8||2009-01-31 16:00:00||2009-01-31 17:47:38|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||9||2009-01-31 15:00:00||2009-01-31 15:41:06|
|1||2009-01-31 00:00:00||2009-02-01 20:58:03||10||2009-01-31 14:00:00||2009-01-31 14:02:38|
|10 rows fetched in 0.0006s (0.0024s)|
This retrieves the range bounds in plain
DATETIME format, suitable for output and calculations.
Hope that helps, John, and thanks for a nice tip!
I'm always glad to answer the questions regarding database queries.