Searching for arbitrary portions of a date at EXPLAIN EXTENDED

I have a Ruby on Rails application with a PostgreSQL database; several tables have created_at and updated_at timestamp attributes.

When displayed, those dates are formatted in the user's locale; for example, the timestamp 2009-10-15 16:30:00.435 becomes the string 15.10.2009 - 16:30 (the date format for this example being dd.mm.yyyy - hh.mm).

The requirement is that the user must be able to search for records by date, as if they were strings formatted in the current locale.

For example, searching for 15.10.2009 would return records with dates on October 15th 2009; searching for 15.10 would return records with dates on October 15th of any year, searching for 15 would return all dates that match 15 (be it day, month or year).

The simplest solution would be just retrieve the locale string from the client, format the dates according to that string and search them using LIKE or ~ operators (the latter, as we all know, searches for POSIX regular expressions).

However, this would be not very efficient.

Let's create a sample table and see:

Table creation details

CREATE TABLE t_dates (
        id INT NOT NULL PRIMARY KEY,
        date TIMESTAMP NOT NULL,
        name VARCHAR(20) NOT NULL,
        stuffing VARCHAR(200) NOT NULL
);

CREATE INDEX ix_dates_date ON t_dates (date);

CREATE INDEX ix_dates_parts ON t_dates
USING GIN((
        ARRAY[
        DATE_PART('year', date)::INTEGER,
        DATE_PART('year', date)::INTEGER % 100,
        DATE_PART('month', date)::INTEGER,
        DATE_PART('day', date)::INTEGER,
        DATE_PART('hour', date)::INTEGER,
        (DATE_PART('hour', date)::INTEGER + 1) % 12 + 1,
        DATE_PART('minute', date)::INTEGER,
        DATE_PART('second', date)::INTEGER
        ]
        ));

SELECT  SETSEED(0.20100202);

INSERT
INTO    t_dates (id, date, name, stuffing)
SELECT  id,
        TO_TIMESTAMP('2010-02-02', 'YYYY-MM-DD') -
        (id || ' hour')::INTERVAL +
        (FLOOR(RANDOM() * 1800) || ' second')::INTERVAL,
        'Date ' || id, 
        RPAD('', 200, '*')
FROM    generate_series(1, 1000000) id;

This query contains 1,000,000 records with random timestamps spanning more than 114 years.

Assuming that the client's date format is set to dd.mm.yy hh24.mi.ss, let's try to select the number of records that satisfy this string: '20.12'. We are also assuming that the beginning of string and the end of string are the field separators as well:

SELECT  COUNT(*)
FROM    t_dates
WHERE   TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)'

count
4235
1 row fetched in 0.0001s (7.9138s)

Aggregate  (cost=47262.58..47262.59 rows=1 width=0)
  ->  Seq Scan on t_dates  (cost=0.00..47258.97 rows=1444 width=0)
        Filter: (to_char(date, 'dd.mm.yy hh24.mi.ss'::text) ~ '(^|[^\\d])20\\.12([^\\d]|$)'::text)

This query runs for almost 8 seconds. Let's see which values does it return:

SELECT  id, date
FROM    t_dates
WHERE   TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)'
ORDER BY
        MD5(id::TEXT)
LIMIT 10

id	date
622924	1939-01-10 20:12:31
781217	1920-12-20 07:11:26
501772	1952-11-05 20:12:03
956539	1900-12-20 04:57:58
523679	1950-05-08 01:20:12
141308	1993-12-20 04:07:37
648220	1936-02-21 20:12:30
236980	1983-01-20 20:12:29
413051	1962-12-20 13:20:54
323566	1973-03-06 02:20:12

We see the matches on day-month, hour-minute and minute-second. There are no month-year or year-hour matches, since there is no 20th month and the year-hour separator is not a period.

The query seems to return correct values but is quite slow.

To improve this query we can use PostgreSQL's GIN indexing abilities.

A GIN index is a way to index one record with several keys.

A plain index is a B-Tree structure that stores the pointers to the records (the ctid's) in the leaf nodes. Such an index only accepts a single expression as a key and builds a single sort order over these expressions, so each record can be pointed to at most once. There is a one-to-many mapping between keys and records: a key can point to many records, but a record can be pointed to by at most one key.

A GIN index, on the other hand, accepts an array of expressions as a parameter and uses each element of the array as a key. This way, the mapping becomes many-to-many: each key can point to many records, and each record can be pointed to by many keys.

Usually, GIN indexes are used for FULLTEXT indexing: the piece of text stored in a record is split into the separate words and each word is indexed separately so that search for any word can be performed using the index.

However, in PostgreSQL, GIN indexes support integer arrays as well. And we can use this support to improve our query.

As many of you may have noted, I created a GIN index in the table creation script. Here's how it looks:

CREATE INDEX ix_dates_parts ON t_dates
USING GIN((
        ARRAY[
        DATE_PART('year', date)::INTEGER,
        DATE_PART('year', date)::INTEGER % 100,
        DATE_PART('month', date)::INTEGER,
        DATE_PART('day', date)::INTEGER,
        DATE_PART('hour', date)::INTEGER,
        (DATE_PART('hour', date)::INTEGER + 1) % 12 + 1,
        DATE_PART('minute', date)::INTEGER,
        DATE_PART('second', date)::INTEGER
        ]
        ));

Each record is split into an array of 8 integers, each representing a certain portion of a date which are normally used in the date formatting options. 6 of them just represent date parts, and there are two extra integers that represent a 2-digit year and an AM/PM hour.

This way, any record in year 2010 gets indexed with both 2010 and 10, and any record with hour 19 gets indexed with both 19 and 7.

This covers most formatting options, and can be changed to include less used ones.

To make use of this index we should provide an additional predicate in the WHERE clause. This predicate will take a user-provided array as an input and search for the records that contain all elements of the user-provided array in the date parts.

Here's how our query looks now:

SELECT COUNT(*)
FROM t_dates
WHERE TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)'
AND ARRAY[20, 12] <@ ARRAY[ DATE_PART('year', date)::INTEGER, DATE_PART('year', date)::INTEGER % 100, DATE_PART('month', date)::INTEGER, DATE_PART('day', date)::INTEGER, DATE_PART('hour', date)::INTEGER, (DATE_PART('hour', date)::INTEGER + 1) % 12 + 1, DATE_PART('minute', date)::INTEGER, DATE_PART('second', date)::INTEGER ] [/sourcecode]

count
4235
1 row fetched in 0.0001s (0.2366s)

Aggregate  (cost=3595.90..3595.91 rows=1 width=0)
  ->  Bitmap Heap Scan on t_dates  (cost=104.76..3595.90 rows=1 width=0)
        Recheck Cond: ('{20,12}'::integer[] <@ ARRAY[(date_part('year'::text, date))::integer, ((date_part('year'::text, date))::integer % 100), (date_part('month'::text, date))::integer, (date_part('day'::text, date))::integer, (date_part('hour'::text, date))::integer, ((((date_part('hour'::text, date))::integer + 1) % 12) + 1), (date_part('minute'::text, date))::integer, (date_part('second'::text, date))::integer])
        Filter: (to_char(date, 'dd.mm.yy hh24.mi.ss'::text) ~ '(^|[^\\d])20\\.12([^\\d]|$)'::text)
        ->  Bitmap Index Scan on ix_dates_parts  (cost=0.00..104.76 rows=1000 width=0)
              Index Cond: ('{20,12}'::integer[] <@ ARRAY[(date_part('year'::text, date))::integer, ((date_part('year'::text, date))::integer % 100), (date_part('month'::text, date))::integer, (date_part('day'::text, date))::integer, (date_part('hour'::text, date))::integer, ((((date_part('hour'::text, date))::integer + 1) % 12) + 1), (date_part('minute'::text, date))::integer, (date_part('second'::text, date))::integer])

This returns the same records but does it 40 times as fast, since the GIN index is used for coarse filtering and the fine-filtering operator is applied to selected results only.

For the index to work, the expression we used to create the index should be provided verbatim to the right side of the <@ (contains) operator.

This solution, however, does not cover all possible conditions: a client can use less obvious formats. However, with a proper design this should not be a problem. The client application should just ignore the parts which are formatted in a way not suitable for the index and do not put them into the array (but of course leave them in the regular expression).

This makes the index less selective but still usable, and the query performance will still be improved greatly.

Written by Quassnoi

February 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED

Searching for arbitrary portions of a date

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow