Overlapping ranges: PostgreSQL at EXPLAIN EXTENDED

Overlapping ranges: PostgreSQL

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Finally, PostgreSQL.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:

(start_date > @week_start AND start_date < @week_end) OR (@week_start BETWEEN start_date AND end_date)

is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)

Now, let's create a sample table in PostgreSQL:

CREATE TABLE t_event
(
id INT NOT NULL PRIMARY KEY,
evt_start TIMESTAMP NOT NULL,
evt_end TIMESTAMP NOT NULL,
evt_value VARCHAR(400) NOT NULL
);

CREATE INDEX ix_event_start ON t_event (evt_start);

CREATE INDEX ix_event_end_start ON t_event (evt_end, evt_start);

SELECT setseed(20090703);

INSERT
INTO    t_event
SELECT  s,
'2009-07-03'::DATE - (s || ' MINUTE')::INTERVAL,
'2009-07-03'::DATE - (s || ' MINUTE')::INTERVAL + (FLOOR(RANDOM() * 35) + 1 || ' DAY')::INTERVAL,
RPAD('Value ' || s, 400, ' ')
FROM    generate_series (1, 1000000) s;

ANALYZE t_event;

PostgreSQL also supports ANSI OVERLAPS operator, and it's even documented (unlike Oracle's one).

Let's use it:

SELECT  SUM(LENGTH(evt_value))
FROM    t_event
WHERE   (evt_start, evt_end) OVERLAPS ('2009-06-21'::TIMESTAMP, '2009-06-21'::TIMESTAMP + '7 DAY'::INTERVAL)

sum
14444400
1 row fetched in 0.0001s (1.1719s)

Aggregate  (cost=72157.44..72157.46 rows=1 width=404)
  ->  Seq Scan on t_event  (cost=0.00..71324.10 rows=333336 width=404)
        Filter: "overlaps"(evt_start, evt_end, '2009-06-21 00:00:00'::timestamp without time zone, '2009-06-28 00:00:00'::timestamp without time zone)

As in Oracle, it's not optimizable, and uses Seq Scan. Hence, it takes more than 1.17 second to complete the query.

Now, the simple condition:

SELECT  SUM(LENGTH(evt_value))
FROM    t_event
WHERE   evt_end &gt;= '2009-05-31'::TIMESTAMP
AND evt_start &lt;= '2009-05-31'::TIMESTAMP + '7 DAYS'::INTERVAL

sum
14450800
1 row fetched in 0.0001s (0.1374s)

Aggregate  (cost=65260.10..65260.11 rows=1 width=404)
  ->  Bitmap Heap Scan on t_event  (cost=2172.84..65083.55 rows=70620 width=404)
        Recheck Cond: ((evt_end >= '2009-05-31 00:00:00'::timestamp without time zone) AND (evt_start <= '2009-06-07 00:00:00'::timestamp without time zone))
        ->  Bitmap Index Scan on ix_event_end_start  (cost=0.00..2155.18 rows=70620 width=0)
              Index Cond: ((evt_end >= '2009-05-31 00:00:00'::timestamp without time zone) AND (evt_start <= '2009-06-07 00:00:00'::timestamp without time zone))

This one is much nicer: only 0.14 seconds.

This query uses the composite index on (evt_end, evt_start) to build a bitmap: a compact in-memory representation of matching values which corresponds each tid (tuple identifier) into exactly one bit in memory.

The bitmap then is scanned sequentially and whenever a 1 bit is found, the tuple is fetched.

This is more efficient than fetching tuples in a nested loop, since sequential access is faster than random.

However, the index is ranged only by evt_end.

Now let's try the complex condition:

SELECT  SUM(LENGTH(evt_value))
FROM    t_event
WHERE   evt_start &gt; '2009-05-31'::TIMESTAMP AND evt_start &lt;= '2009-05-31'::TIMESTAMP + '7 DAYS'::INTERVAL
        OR ('2009-05-31'::TIMESTAMP BETWEEN evt_start AND evt_end)

sum
14450800
1 row fetched in 0.0001s (0.1254s)

Aggregate  (cost=66192.63..66192.65 rows=1 width=404)
  ->  Bitmap Heap Scan on t_event  (cost=2458.15..65996.52 rows=78445 width=404)
        Recheck Cond: (((evt_start > '2009-05-31 00:00:00'::timestamp without time zone) AND (evt_start <= '2009-06-07 00:00:00'::timestamp without time zone)) OR (('2009-05-31 00:00:00'::timestamp without time zone <= evt_end) AND ('2009-05-31 00:00:00'::timestamp without time zone >= evt_start)))
        ->  BitmapOr  (cost=2458.15..2458.15 rows=79084 width=0)
              ->  Bitmap Index Scan on ix_event_start  (cost=0.00..263.75 rows=9131 width=0)
                    Index Cond: ((evt_start > '2009-05-31 00:00:00'::timestamp without time zone) AND (evt_start <= '2009-06-07 00:00:00'::timestamp without time zone))
              ->  Bitmap Index Scan on ix_event_end_start  (cost=0.00..2155.18 rows=69953 width=0)
                    Index Cond: (('2009-05-31 00:00:00'::timestamp without time zone <= evt_end) AND ('2009-05-31 00:00:00'::timestamp without time zone >= evt_start))

This uses the same technique with two passes: the first pass fills the bitmap with the values that correspond to the first condition (served by an index on evt_start), the second one fills the same bitmap with the values from the second condition, OR'ing the bits if necessary.

This is even more efficient, since computers nowadays are especially good in performing OR operations on preallocated memory.

Written by Quassnoi

July 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED