Join on overlapping date ranges at EXPLAIN EXTENDED

Join on overlapping date ranges

Comments enabled. I *really* need your comment

This post is inspired by a discussion with John Didion:

Is there any way to optimize the query for overlapping ranges in MySQL if both ranges are dynamic?

I have two tables, each with integer range columns (specified as LineString), and I want to find rows that overlap.

No matter what I try, the query planner never uses any indexes.

This question addresses a well-known problem of efficient searching for the intersecting intervals. The queries that deal with it require ability to search for the intervals (stored in two distinct columns) containing a constant scalar value.

Plain B-Tree indexes used by most databases do not speed up the queries like that. However, MySQL supports SPATIAL indexes that can index two-dimensional shapes and efficiently search for the shapes containing a given point.

With a little effort, time intervals can be converted into the geometrical objects, indexed with a SPATIAL index and searched for the given point in time (also presented as a gemetrical object). This is described in the article about overlapping ranges in MySQL.

The query described in that article, however, searches for the intervals overlapping a constant range, provided as a parameter to the query. Now, I will discuss how to adapt the query for a JOIN between two tables.

Let's create two sample tables, each containing a set of time ranges stored as geometrical objects, and find all records from both tables whose ranges overlap:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_big (
id INT NOT NULL PRIMARY KEY,
rg LineString NOT NULL,
SPATIAL KEY sx_big_rg (rg)
) ENGINE=MyISAM;

CREATE TABLE t_small (
id INT NOT NULL PRIMARY KEY,
rg LineString NOT NULL,
SPATIAL KEY sx_small_rg (rg)
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(120000);
COMMIT;

INSERT
INTO    t_big (id, rg)
SELECT  id,
        LineString(
        Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)),
        Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end))
        )
FROM    (
        SELECT  id,
                STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR AS range_start,
                STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR + INTERVAL RAND(20100201) * 7200 SECOND AS range_end
        FROM    filler
        ) q;

INSERT
INTO    t_small (id, rg)
SELECT  id,
        LineString(
        Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)),
        Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end))
        )
FROM    (
        SELECT  id,
                STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY AS range_start,
                STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY + INTERVAL RAND(20100201) * 172800 SECOND AS range_end
        FROM    filler
        ORDER BY
                id        
        LIMIT 5000
        ) q

There are two tables, t_big and t_small.

t_big contains 120,000 records with intervals starting each hour and having random length from 0 to 2 hours, 1 hour in average.

t_small contains 5,000 records with intervals starting each day and having random length from 0 to 2 days, 1 day in average.

The intervals are presented as the LineString records with a single line connecting Point(-1, range_start) and Point(1, range_end). range_start and range_end are presented as the UNIX timestamps (numbers of seconds since Jan 1st, 1970). To avoid year 2038 problem, we use TIMESTAMPDIFF rather than UNIX_TIMESTAMP to calculate the value.

Each table is indexes with a SPATIAL index on the rg field.

First of all, we need to make sure that the index is created well.

MySQL uses R-Tree for the SPATIAL indexes, which works according to the principle of least enlargement to the bounding box when adding a value to the index. Each leaf node is covered by a certain minimum bounding box stored in the branch node, and the branch which needs to be increased to the least extent is chosen as a new parent for the element.

Our ranges are one-dimensional, but MySQL requires the spatial data to be two-dimensional. That's why we need to use either of the dimensions to represent the time ranges and fill another one to span some constant range so that the box areas would be proportional to the range lengths.

However, if we just fill both ends of the LineString with a constant, this will make the lines pure horizontal or pure vertical. Since they all reside on one line, the minimal bounding boxes of any set of the LineString's will have zero area (since they all will be of zero height). MySQL will not be able to choose the least enlargement: in a two-dimensional space, any horizontal enlargement of a one-dimensional line will be a zero. MySQL will just append the entry to any randomly chosen branch in no certain order. This makes the index very unbalanced and in fact unusable.

To make the index usable, we need to make the boxes have positive area so that the least enlargement principle could do its job. That's why we create the LineString fields diagonally, with the first coordinates being -1 for the start point and 1 for the end point.

Now, we should make a query that would find the intersections of these ranges.

MySQL documentation evasively mentions that the predicates supported by the spatial indexes are MBRContains and MBRWithin (with the latter being just the synonym for MBRContains with the argument order reversed), with a side note that in future releases, spatial indexes may also be used for optimizing other functions

If we were to comply with the documentation and limit ourselves to using only these two functions for the spatial indexes, the query would still be possible and quite efficient.

MBRContains checks if the bounding box of the second argument lies within the bounding box for the first argument. This is not always true for the overlapping ranges. If the ranges overlap partially, that is the first range both starts and ends earlier than the second range, then no ranges lie completely within each other's bounds, and MBRContains will return 0 for any argument order.

However, there is a simple condition that checks whether the ranges do overlap.

With any pair of ranges, there is a range that starts later (or at the same time) than another. If the ranges overlap, then the start of that range should lie within the bounds of its counterpart. Indeed, the start of the latest range should be equal to or greater then the start of the earliest range (by definition) and equal to of less than the end of the earliest range (or the ranges do not overlap).

This can be checked with a simple condition: MBRContains(rg1, StartPoint(rg2)).

Since there is no sargable condition that would find out which range starts earlier, we'll just need to check both ways.

Unfortunately, MySQL does not support merging spatial indexes, so an OR condition would make the indexes unusable. We need to use the two MBRContains predicates with two separate queries and merge the results using UNION.

When optimizing a join, MySQL's optimizer usually makes the smaller table leading in the join. With a usual equijoin condition this is a wise thing to do, since MySQL uses nested loops with an index search to join the tables. The duration of the nested loops query is the product of the leading table scan time (the number of the records) and driven table search time. B-Tree search is logarithmic, so (n × log m) is less than (m × log n) as long as m < n. That's why the leading table should be the smallest one.

However, MySQL does not take into account that MBRContains predicate is not symmetrical. Reversing the table scan order makes each search operation to be not a searching for boxes containing the point, but searching for start points contained by the box. Since the individual points are not even indexed, MySQL will not be able to use an index access path for such a query and will revert to a very inefficient join buffer (which in fact is comparing every record to every other record).

To work around this, we need to force the join order so that the table with the points leads.

In MySQL, this is done by replacing JOIN with STRAIGHT_JOIN, which forces the leftmost table to be leading in the join.

Here's the query:

SELECT  COUNT(*)
FROM    (
SELECT  s.id AS sid, b.id AS bid
FROM    t_small s
STRAIGHT_JOIN
t_big b
ON      MBRContains(b.rg, StartPoint(s.rg))
UNION
SELECT  s.id AS sid, b.id AS bid
FROM    t_big b
STRAIGHT_JOIN
t_small s
ON      MBRContains(s.rg, StartPoint(b.rg))
) q

COUNT(*)
124911
1 row fetched in 0.0001s (4.4999s)

id	select_type	table	type	possible_keys	key	key_len	rows	filtered	Extra
1	PRIMARY								Select tables optimized away
2	DERIVED	s	ALL				5000	100.00
2	DERIVED	b	range	sx_big_rg	sx_big_rg	34	1	12000000.00	Range checked for each record (index map: 0x2)
3	UNION	b	ALL				120000	100.00
3	UNION	s	range	sx_small_rg	sx_small_rg	34	1	500000.00	Range checked for each record (index map: 0x2)
	UNION RESULT	<union2,3>	ALL

select count(0) AS `COUNT(*)` from (select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` union select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_big` `b` straight_join `20100201_ranges`.`t_small` `s`) `q`

This query finds the total number of pairs of overlapping ranges from both tables and runs for 4.5 seconds which is quite efficient.

MBRIntersects: a more efficient solution

Normally I would put the words hope that helps and a link to the question form here, which I use to close the articles with the answers.

But when testing the solution, John found out that MBRIntersects function (which just checks intersection of two boxes without splitting them into start and end points) is sargable too and the query above is quite overcomplicated.

This function is symmetric, that is MBRIntersects(a, b) ≡ MBRIntersects(b, a). However, MySQL is not aware of that, and an index is used only against the column in the first argument to MBRIntersects. This means we have to make t_small to lead in the join and put its column into the second argument of the function.

This way, the column from the driven table (t_big) will be searched for the value provided from the leading table using the index on the driven table, which is exactly what we need.

Here's the new updated query:

SELECT  COUNT(*)
FROM    t_small s
STRAIGHT_JOIN
t_big b
ON      MBRIntersects(b.rg, s.rg)

COUNT(*)
124911
1 row fetched in 0.0001s (1.2656s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	s	ALL	sx_small_rg				5000	100.00
1	SIMPLE	b	ALL	sx_big_rg				120000	100.00	Range checked for each record (index map: 0x2)

select count(0) AS `COUNT(*)` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` where intersects(`20100201_ranges`.`b`.`rg`,`20100201_ranges`.`s`.`rg`)

As we can see, this query is much more simple and completes 3 times as fast.

Finally, let's see how can we retrieve the datetime values of the start and end of the ranges from the LineString's:

SELECT  s.id AS small_id,
'1970-01-01' + INTERVAL Y(StartPoint(s.rg)) SECOND AS small_start,
'1970-01-01' + INTERVAL Y(EndPoint(s.rg)) SECOND AS small_end,
b.id AS big_id,
'1970-01-01' + INTERVAL Y(StartPoint(b.rg)) SECOND AS big_start,
'1970-01-01' + INTERVAL Y(EndPoint(b.rg)) SECOND AS big_end
FROM    t_small s
STRAIGHT_JOIN
t_big b
ON      MBRIntersects(b.rg, s.rg)
LIMIT 10

small_id	small_start	small_end	big_id	big_start	big_end
1	2009-01-31 00:00:00	2009-02-01 20:58:03	1	2009-01-31 23:00:00	2009-02-01 00:52:25
1	2009-01-31 00:00:00	2009-02-01 20:58:03	2	2009-01-31 22:00:00	2009-01-31 22:01:55
1	2009-01-31 00:00:00	2009-02-01 20:58:03	3	2009-01-31 21:00:00	2009-01-31 21:32:21
1	2009-01-31 00:00:00	2009-02-01 20:58:03	4	2009-01-31 20:00:00	2009-01-31 20:36:01
1	2009-01-31 00:00:00	2009-02-01 20:58:03	5	2009-01-31 19:00:00	2009-01-31 20:23:00
1	2009-01-31 00:00:00	2009-02-01 20:58:03	6	2009-01-31 18:00:00	2009-01-31 19:06:56
1	2009-01-31 00:00:00	2009-02-01 20:58:03	7	2009-01-31 17:00:00	2009-01-31 18:25:41
1	2009-01-31 00:00:00	2009-02-01 20:58:03	8	2009-01-31 16:00:00	2009-01-31 17:47:38
1	2009-01-31 00:00:00	2009-02-01 20:58:03	9	2009-01-31 15:00:00	2009-01-31 15:41:06
1	2009-01-31 00:00:00	2009-02-01 20:58:03	10	2009-01-31 14:00:00	2009-01-31 14:02:38
10 rows fetched in 0.0006s (0.0024s)

This retrieves the range bounds in plain DATETIME format, suitable for output and calculations.

Hope that helps, John, and thanks for a nice tip!

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 1st, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED