EXPLAIN EXTENDED

How to create fast database queries

Archive for December 28th, 2009

Oracle: joining timestamps and time intervals

with one comment

From Stack Overflow:

I am making an inner join of two tables.

I have many time intervals represented by intervals and i want to get measure data from measures only within those intervals. The intervals do not overlap.

Here are the table layouts:

intervals
entry_time exit_time
measures
time measure

There are 1295 records in intervals and about a million records in measures. The intervals do not overlap.

The result I want to get is a table with in the first column the measure, then the time the measure has been done, the begin/end time of the considered interval (it would be repeated for row with a time within the considered range)

How can I make this faster?

Straight query

The query to get the results would look like this:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time ASC

This looks quite simple. However, choosing the execution plan can be somewhat tricky task for this query.

The table layout gives us no hint on how many rows will be returned from the join. If all intervals begin before the time of the first measure and end after the time of the last measure, then every combination of rows will be returned from both tables and the resulting recordset will in fact be a cartesian product of two tables. If the intervals are short and sparse, few or even no rows can be returned from measures.

However, we know that the intervals do not overlap. This means that each measure may belong to at most one interval and the number of records in the final resultset will be no more than the number of records in measures.

Since the condition we join here is a pair of inequalities, only two methods can be used by the engine to perform the join, that is NESTED LOOPS and MERGE JOIN. HASH JOIN which is most efficient on large tables requires an equijoin and cannot be used for this query.

MERGE JOIN sorts both resultsets on the columns it joins on and gradually advances the internal pointer on both sorted row sources so that the values of the columns being joined on in both tables always match. In case of an equijoin, the engine returns only the records holding the current value of the pointer; in case of a join on inequality condition, the engine returns all records greater (or less) than the current value from the corresponding table.

MERGE JOIN, however, can only satisfy a single inequality condition while we here have two of them on two different columns. The engine should split this condition into the pair of inequalities: the first one will be used by the join; the second one will be just filtered. The query essentially turns into this one:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time >= intervals.entry_time
WHERE   measures.time  <= intervals.exit_time
ORDER BY
        time ASC


Here, the MERGE JOIN will be using the predicate in the ON clause and the filtering will use the predicate in the WHERE clause. The predicates are symmetrical and can easily be swapped.

However, the join will have to return all records to the filtering code. And with an inequality condition like the one we see above, there will be lots of records. If we take a normal situation: the interval bounds are more or less in accordance with the dates of the first and last measure and the intervals are distributed evenly, then the MERGE JOIN will return (1265 × 1,000,000) / 2 ≈ 600,000,000 records, each to be filtered on the next step.

From performance's point of view, this is hardly different from a cartesian join: in fact, it needs to process only as few as a half of the rows. These rows, however, need to be sorted beforehand (or taken from the index which is slow to traverse in its sort order), so this can actually even be slower than a cartesian join which does not require the rows to be in any certain order and can use an INDEX FAST FULL SCAN or a TABLE SCAN.

This means the only efficient way to run this query is using NESTED LOOPS. To benefit from this method, we should make an index on measures.time and convince Oracle to use it.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 28th, 2009 at 11:00 pm

Posted in Oracle