Overlapping numbers at EXPLAIN EXTENDED

Overlapping numbers

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Kat asks:

I want to determine if I have any overlapping ranges of numbers in a table for specific id's.

For example, the table has the following:

ID START_NUMBER END_NUMBER

1 3035112000 3035112010

2 3035112011 3035112020

3 3035112003 3035112005

4 3032925000 3032925010

1 3035113000 3035113050

2 3035113000 3035113050

3 3035113051 3035113052

4 3032925500 3032925550

I only want to pull those that have overlapping number ranges for id's 1, 2, and 3 (but not 4).

This is in Oracle.

ID	START_NUMBER	END_NUMBER
1	3035112000	3035112010
2	3035112011	3035112020
3	3035112003	3035112005
4	3032925000	3032925010
1	3035113000	3035113050
2	3035113000	3035113050
3	3035113051	3035113052
4	3032925500	3032925550

We need to write a query to pull out all id's of rows that overlap at least one other row in the table.

Let's create a pair of sample tables:

CREATE TABLE t_number
(
id NUMBER(10) NOT NULL,
start_number NUMBER(10) NOT NULL,
end_number NUMBER(10) NOT NULL
)
/
CREATE TABLE t_numbers
(
id NUMBER(10) NOT NULL,
start_number NUMBER(10) NOT NULL,
end_number NUMBER(10) NOT NULL
)
/
CREATE INDEX ix_number2_start ON t_number2 (start_number)
/
BEGIN
DBMS_RANDOM.seed(20090702);
END;
/
INSERT
INTO    t_number
SELECT  1 AS id, 3035112000 AS start_number, 3035112010 AS end_number
FROM    dual
UNION ALL
SELECT  2, 3035112011, 3035112020
FROM    dual
UNION ALL
SELECT  3, 3035112003, 3035112005
FROM    dual
UNION ALL
SELECT  4, 3032925000, 3032925010
FROM    dual
UNION ALL
SELECT  1, 3035113000, 3035113050
FROM    dual
UNION ALL
SELECT  2, 3035113000, 3035113050
FROM    dual
UNION ALL
SELECT  3, 3035113051, 3035113052
FROM    dual
UNION ALL
SELECT  4, 3032925500, 3032925550
FROM    dual
/
INSERT
INTO    t_number2
SELECT  TRUNC(lvl / 2),
TRUNC(init * 50000000),
TRUNC(init * 50000000) + TRUNC(offset * 5)
FROM    (
SELECT  /*+ NO_MERGE */
level AS lvl,
DBMS_RANDOM.value() AS init,
DBMS_RANDOM.value() AS offset
FROM    dual
CONNECT BY
level &lt;= 10000
        )
/
COMMIT
/

The first table, called t_number, contains the sample data provided by the author of the question.

The second table, called t_number2, contains 10,000 random rows to test the queries for efficiency.

Now, how will we query?

We can go straightforward: just make a self-join, filter the rows that overlap some other row in the table and get rid of duplicate id's using DISTINCT.

To find out if two rows overlap each other, we can use a simple condition we discussed earlier:

SELECT  DISTINCT q1.id
FROM    t_number q1
JOIN    t_number q2
ON      q1.start_number &lt; q2.end_number
        AND q1.end_number &gt; q2.start_number
AND q1.rowid &lt;&gt; q2.rowid
ORDER BY
q1.id

ID
1
2
3
3 rows fetched in 0.0001s (0.0007s)

This works fine for a small table, but when we'll try to query a larger table, it will take way too much time:

SELECT  DISTINCT q1.id
FROM    t_number2 q1
JOIN    t_number2 q2
ON      q1.start_number &lt;= q2.end_number
        AND q1.end_number &gt;= q2.start_number
AND q1.rowid &lt;&gt; q2.rowid
ORDER BY
q1.id

ID
59
443
596
1286
1296
1677
2806
3068
3416
3472
3605
3828
4607
4639
14 rows fetched in 0.0003s (17.7964s)

SELECT STATEMENT 
 SORT UNIQUE
  MERGE JOIN 
   SORT JOIN
    TABLE ACCESS FULL, 20090706_overlaps.T_NUMBER2
   FILTER 
    SORT JOIN
     TABLE ACCESS FULL, 20090706_overlaps.T_NUMBER2

Almost 18 seconds, not good.

Instead of this simple condition, we can use the same approach we used to build contiguous ranges in this article:

Flattening timespans: Oracle

We need to sort all values by start_number and calculate the running maximum of end_number.

Then we will just have to find all rows that overlap their preceding row (a row does if the current value of start_number is less than the previous value of the running maximum of end_number).

Overlapping, though, is symmetrical operation (if row 1 overlaps row 2, this means that row 2 overlaps row 1 too), so we will have to return id of the previous row too as a separate record, i. e. make an unpivot. In Oracle, it's best to do this by joining the resultset with a constructed helper table.

This is more efficient but less readable and requires a little explanation, so I'll construct the query in steps.

First, we need to find the running maximum of end_number for each row. In Oracle, it's super easy to do this by using analytic functions, namely MAX:

SELECT  id, start_number, MAX(end_number) OVER (ORDER BY start_number) AS max_end_number
FROM    t_number

ID	START_NUMBER	MAX_END_NUMBER
4	3032925000	3032925010
4	3032925500	3032925550
1	3035112000	3035112010
3	3035112003	3035112010
2	3035112011	3035112020
1	3035113000	3035113050
2	3035113000	3035113050
3	3035113051	3035113052

The column max_end_number in the query above is the running maximum: it's the maximal value of end_number over all preceding values.

Now we need to find all rows whose start_number does not exceed the running maximum for the previous row.

SQL does not provide an easily way to make vertical (inter-row) calculations in a resultset, but Oracle's extensions will help us in this case. We will just need the function LAG, which returns the value of any column from the previous row.

For the sake of readability, I will use CTE's (aliased queries) instead of the nested queries:

WITH    ranges AS
(
SELECT  id, start_number, MAX(end_number) OVER (ORDER BY start_number) AS max_end_number
FROM    t_number
)
SELECT  id, start_number, max_end_number,
LAG(max_end_number) OVER (ORDER BY start_number) AS prev_max_end_number,
LAG(id) OVER (ORDER BY start_number) AS prev_id
FROM    ranges

ID	START_NUMBER	MAX_END_NUMBER	PREV_MAX_END_NUMBER	PREV_ID
4	3032925000	3032925010
4	3032925500	3032925550	3032925010	4
1	3035112000	3035112010	3032925550	4
3	3035112003	3035112010	3035112010	1
2	3035112011	3035112020	3035112010	3
1	3035113000	3035113050	3035112020	2
2	3035113000	3035113050	3035113050	1
3	3035113051	3035113052	3035113050	2

Now we need to select id's of rows that overlap their preceding row. This is done by adding a simple condition to the query above:

WITH    ranges AS
(
SELECT  id,
start_number,
MAX(end_number) OVER (ORDER BY start_number) AS max_end_number
FROM    t_number
),
lags AS
(
SELECT  id, start_number,
LAG(max_end_number) OVER (ORDER BY start_number) AS prev_max_end_number,
LAG(id) OVER (ORDER BY start_number) AS prev_id
FROM    ranges
)
SELECT  id, prev_id
FROM    lags
WHERE   start_number &lt;= prev_max_end_number

ID	PREV_ID
3	1
2	1

We see that one of the rows with id = 3 overlaps a row with id = 1 and a row with id = 2 overlaps another row with id = 1. That means that rows with id = 1 overlap their corresponding rows toom and 1 be included into the final resultset just as well.

To do this, we need to unpivot that table by joining it with an inline helper rowset and uniquify the returned id's using DISTINCT:

WITH    ranges AS
(
SELECT  id,
start_number,
MAX(end_number) OVER (ORDER BY start_number) AS max_end_number
FROM    t_number
),
lags AS
(
SELECT  id, start_number,
LAG(max_end_number) OVER (ORDER BY start_number) AS prev_max_end_number,
LAG(id) OVER (ORDER BY start_number) AS prev_id
FROM    ranges
),
overlaps AS
(
SELECT  id, prev_id
FROM    lags
WHERE   start_number &lt;= prev_max_end_number
        ),
        pivot AS (
        SELECT  1 AS which
        FROM    dual
        UNION ALL
        SELECT  2 AS which
        FROM    dual
        )
SELECT  DISTINCT DECODE(which, 1, id, 2, prev_id) AS id
FROM    overlaps
CROSS JOIN
        pivot

ID
1
2
3
3 rows fetched in 0.0002s (0.0014s)

Here's the result we're after.

Now, let's check performance on the large table:

WITH    ranges AS
(
SELECT  id,
start_number,
MAX(end_number) OVER (ORDER BY start_number) AS max_end_number
FROM    t_number2
),
lags AS
(
SELECT  id, start_number,
LAG(max_end_number) OVER (ORDER BY start_number) AS prev_max_end_number,
LAG(id) OVER (ORDER BY start_number) AS prev_id
FROM    ranges
),
overlaps AS
(
SELECT  id, prev_id
FROM    lags
WHERE   start_number &lt;= prev_max_end_number
        ),
        pivot AS (
        SELECT  1 AS which
        FROM    dual
        UNION ALL
        SELECT  2 AS which
        FROM    dual
        )
SELECT  DISTINCT DECODE(which, 1, id, 2, prev_id) AS id
FROM    overlaps
CROSS JOIN
        pivot
ORDER BY
        id

ID
59
443
596
1286
1296
1677
2806
3068
3416
3472
3605
3828
4607
4639
14 rows fetched in 0.0003s (0.0499s)

SELECT STATEMENT 
 SORT UNIQUE
  MERGE JOIN CARTESIAN
   VIEW 
    UNION-ALL 
     FAST DUAL 
     FAST DUAL 
   BUFFER SORT
    VIEW 
     WINDOW BUFFER
      VIEW 
       WINDOW SORT
        TABLE ACCESS FULL, 20090706_overlaps.T_NUMBER2

Instead of costly self-join, this query uses but a single table scan and a cheap CARTESIAN JOIN with an in-memory table to unpivot the results.

That's why for 10,000 rows this query takes only 40 ms, which is almost instant.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

July 6th, 2009 at 11:00 pm

Posted in Oracle

EXPLAIN EXTENDED