Searching for arbitrary portions of a date
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a Ruby on Rails application with a PostgreSQL database; several tables have
created_at
andupdated_at
timestamp attributes.When displayed, those dates are formatted in the user's locale; for example, the timestamp 2009-10-15 16:30:00.435 becomes the string 15.10.2009 - 16:30 (the date format for this example being
dd.mm.yyyy - hh.mm
).The requirement is that the user must be able to search for records by date, as if they were strings formatted in the current locale.
For example, searching for 15.10.2009 would return records with dates on October 15th 2009; searching for 15.10 would return records with dates on October 15th of any year, searching for 15 would return all dates that match 15 (be it day, month or year).
The simplest solution would be just retrieve the locale string from the client, format the dates according to that string and search them using LIKE
or ~
operators (the latter, as we all know, searches for POSIX regular expressions).
However, this would be not very efficient.
Let's create a sample table and see:
Table creation details
This query contains 1,000,000 records with random timestamps spanning more than 114 years.
Assuming that the client's date format is set to dd.mm.yy hh24.mi.ss
, let's try to select the number of records that satisfy this string: '20.12'
. We are also assuming that the beginning of string and the end of string are the field separators as well:
SELECT COUNT(*) FROM t_dates WHERE TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)'
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4235 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (7.9138s) |
Aggregate (cost=47262.58..47262.59 rows=1 width=0) -> Seq Scan on t_dates (cost=0.00..47258.97 rows=1444 width=0) Filter: (to_char(date, 'dd.mm.yy hh24.mi.ss'::text) ~ '(^|[^\\d])20\\.12([^\\d]|$)'::text)
This query runs for almost 8 seconds. Let's see which values does it return:
SELECT id, date FROM t_dates WHERE TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)' ORDER BY MD5(id::TEXT) LIMIT 10
id | date |
---|---|
622924 | 1939-01-10 20:12:31 |
781217 | 1920-12-20 07:11:26 |
501772 | 1952-11-05 20:12:03 |
956539 | 1900-12-20 04:57:58 |
523679 | 1950-05-08 01:20:12 |
141308 | 1993-12-20 04:07:37 |
648220 | 1936-02-21 20:12:30 |
236980 | 1983-01-20 20:12:29 |
413051 | 1962-12-20 13:20:54 |
323566 | 1973-03-06 02:20:12 |
We see the matches on day-month, hour-minute and minute-second. There are no month-year or year-hour matches, since there is no 20th month and the year-hour separator is not a period.
The query seems to return correct values but is quite slow.
To improve this query we can use PostgreSQL's GIN
indexing abilities.
A GIN
index is a way to index one record with several keys.
A plain index is a B-Tree structure that stores the pointers to the records (the ctid
's) in the leaf nodes. Such an index only accepts a single expression as a key and builds a single sort order over these expressions, so each record can be pointed to at most once. There is a one-to-many mapping between keys and records: a key can point to many records, but a record can be pointed to by at most one key.
A GIN
index, on the other hand, accepts an array of expressions as a parameter and uses each element of the array as a key. This way, the mapping becomes many-to-many: each key can point to many records, and each record can be pointed to by many keys.
Usually, GIN
indexes are used for FULLTEXT
indexing: the piece of text stored in a record is split into the separate words and each word is indexed separately so that search for any word can be performed using the index.
However, in PostgreSQL, GIN
indexes support integer arrays as well. And we can use this support to improve our query.
As many of you may have noted, I created a GIN
index in the table creation script. Here's how it looks:
CREATE INDEX ix_dates_parts ON t_dates USING GIN(( ARRAY[ DATE_PART('year', date)::INTEGER, DATE_PART('year', date)::INTEGER % 100, DATE_PART('month', date)::INTEGER, DATE_PART('day', date)::INTEGER, DATE_PART('hour', date)::INTEGER, (DATE_PART('hour', date)::INTEGER + 1) % 12 + 1, DATE_PART('minute', date)::INTEGER, DATE_PART('second', date)::INTEGER ] ));
Each record is split into an array of 8 integers, each representing a certain portion of a date which are normally used in the date formatting options. 6 of them just represent date parts, and there are two extra integers that represent a 2-digit year and an AM/PM hour.
This way, any record in year 2010 gets indexed with both 2010 and 10, and any record with hour 19 gets indexed with both 19 and 7.
This covers most formatting options, and can be changed to include less used ones.
To make use of this index we should provide an additional predicate in the WHERE
clause. This predicate will take a user-provided array as an input and search for the records that contain all elements of the user-provided array in the date parts.
Here's how our query looks now:
SELECT COUNT(*)
FROM t_dates
WHERE TO_CHAR(date, 'dd.mm.yy hh24.mi.ss') ~ E'(^|[^\\d])20\\.12([^\\d]|$)'
AND ARRAY[20, 12] <@ ARRAY[
DATE_PART('year', date)::INTEGER,
DATE_PART('year', date)::INTEGER % 100,
DATE_PART('month', date)::INTEGER,
DATE_PART('day', date)::INTEGER,
DATE_PART('hour', date)::INTEGER,
(DATE_PART('hour', date)::INTEGER + 1) % 12 + 1,
DATE_PART('minute', date)::INTEGER,
DATE_PART('second', date)::INTEGER
]
[/sourcecode]
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4235 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.2366s) |
Aggregate (cost=3595.90..3595.91 rows=1 width=0) -> Bitmap Heap Scan on t_dates (cost=104.76..3595.90 rows=1 width=0) Recheck Cond: ('{20,12}'::integer[] <@ ARRAY[(date_part('year'::text, date))::integer, ((date_part('year'::text, date))::integer % 100), (date_part('month'::text, date))::integer, (date_part('day'::text, date))::integer, (date_part('hour'::text, date))::integer, ((((date_part('hour'::text, date))::integer + 1) % 12) + 1), (date_part('minute'::text, date))::integer, (date_part('second'::text, date))::integer]) Filter: (to_char(date, 'dd.mm.yy hh24.mi.ss'::text) ~ '(^|[^\\d])20\\.12([^\\d]|$)'::text) -> Bitmap Index Scan on ix_dates_parts (cost=0.00..104.76 rows=1000 width=0) Index Cond: ('{20,12}'::integer[] <@ ARRAY[(date_part('year'::text, date))::integer, ((date_part('year'::text, date))::integer % 100), (date_part('month'::text, date))::integer, (date_part('day'::text, date))::integer, (date_part('hour'::text, date))::integer, ((((date_part('hour'::text, date))::integer + 1) % 12) + 1), (date_part('minute'::text, date))::integer, (date_part('second'::text, date))::integer])
This returns the same records but does it 40 times as fast, since the GIN
index is used for coarse filtering and the fine-filtering operator is applied to selected results only.
For the index to work, the expression we used to create the index should be provided verbatim to the right side of the <@
(contains) operator.
This solution, however, does not cover all possible conditions: a client can use less obvious formats. However, with a proper design this should not be a problem. The client application should just ignore the parts which are formatted in a way not suitable for the index and do not put them into the array (but of course leave them in the regular expression).
This makes the index less selective but still usable, and the query performance will still be improved greatly.