Archive for the ‘PostgreSQL’ Category
Flattening timespans: PostgreSQL 8.4
On Jul 1, 2009, PostgreSQL 8.4 was released.
Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE‘s).
Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer’s life much, much easier.
Let’s check how efficienly these features are implemented.
To do this, I’ll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.
I’ll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:
This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there’s a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they’re overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
Let’s create a sample table:
Read the rest of this entry »
Overlapping ranges: PostgreSQL
Continuing the theme on overlapping ranges:
From Stack Overflow:
I have an event table that specifies a date range with
start_dateandend_datefields.I have another date range, specified in code, that defines the current week as
week_startandweek_end.I’d like to query all events for the week.
Finally, PostgreSQL.
A quick reminder: we have two options to write a condition for such a query.
A more simple one:
start_date < @week_end AND end_date > @week_start
is, uh, more simple, but less efficient in SQL Server and MySQL.
A more complex one:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)
Now, let's create a sample table in PostgreSQL:
Read the rest of this entry »
Flattening timespans: PostgreSQL
In previous three articles:
, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.
Today, I’ll describe how to do it in PostgreSQL.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there’s a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they’re overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.
Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.
We should create a set-returning function, open a cursor inside it and select all timespans ordered by start, keeping track of a running maximum of stop.
Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.
Here’s the sample table:
Read the rest of this entry »
Longest common prefix: PostgreSQL
This is a series of articles on how to strip all strings in a set of their longest common prefix and concatenate the results:
- Longest common prefix: SQL Server
- Longest common prefix: PostgreSQL
- Longest common prefix: Oracle
- Longest common prefix: MySQL
Today, I’ll show how to do it in PostgreSQL.
A quick reminder of the problem (taken from Stack Overflow):
I have some data:
id ref 1 3536757616 1 3536757617 1 3536757618 2 3536757628 2 3536757629 2 3536757630 and want to get the result like this:
id result 1 3536757616/7/8 2 3536757629/28/30 Essentially, the data should be aggregated on
id, and theref‘s should be concatenated together and separated by a/(slash), but with longest common prefix removed.
Like with SQL Server, it is possible to do this in PostgreSQL using a single SQL query.
But since PostgreSQL offers a nice ability to create custom aggregates, I’ll better demonstrate how to solve this task using ones.
In my opinion, custom aggregates fit here just perfectly.
Since PostgreSQL lacks native support for aggregate concatenation, we will need to create two custom aggregates here:
Read the rest of this entry »
Aggregate AND
From Stack Overflow:
I have a table with a foreign key and a boolean value (and a bunch of other columns that aren’t relevant here), as such:
CREATE TABLE myTable ( someKey integer, someBool boolean ); INSERT INTO myTable VALUES (1, 't'), (1, 't'), (2, 'f'), (2, 't');Each
someKeycould have 0 or more entries.For any given
someKey, I need to know if
- All the entries are true, or
- Any of the entries are false
Basically, it’s an
AND.
This solution is often used to represent polls that should be unanimous for the decision to be made (i. e. anyone can put a veto on the decision).
PostgreSQL offers a special aggregate BOOL_AND to do this.
However, an aggregate may be less efficient here.
The return value of an AND function is constrained by finding certain values:
- Whenever a
FALSEis found, the return value cannot beTRUEanymore. It’s eitherFALSEorNULL. - Whenever a
NULLis found, the return value isNULL
Aggregates in PostgreSQL, however, won’t take this into account.
What we need here is a method to stop and return whenever first NULL or FALSE value is found.
Let’s create a sample table and see how can it may be done:
Read the rest of this entry »
Hierarchical queries in PostgreSQL
Note: this article concerns PostgreSQL 8.3 and earlier.
For hierarchical queries in PostgreSQL 8.4 and higher, see this article:
In one of the previous articles I wrote about using hierarchical queries in MySQL:
PostgreSQL has a contrib module to implement the same functionality.
However, it’s not always possible to install and use contribs. Same is true for procedural languages.
Fortunately, this functionality can be implemented using a plain SQL function.
Let’s create a sample table and see how it works:
Read the rest of this entry »
PostgreSQL: selecting a function
From Stack Overflow:
Hello,
I want to write a
SELECTstatement as follows:SELECT field_a FROM my_table WHERE field_b IN (my_function(field_c))Is that possible?
Would
my_functionhave to return an array?
It is of course possible in PostgreSQL using a set returning function.
Return type of such a function should be declared as SETOF, so the function will return a rowset of given type.
There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.
Let’s create sample tables:
Read the rest of this entry »
PostgreSQL: emulating ROW_NUMBER
Note: this article concerns PostgreSQL 8.3 and below.
PostgreSQL 8.4 introduces window functions.
Window function ROW_NUMBER() implements the functionality in question more efficiently.
In one of the previous articles:
, I described emulating Oracle‘s pseudocolumn ROWNUM in PostgreSQL.
Now, we’ll extend this query to emulate ROW_NUMBER.
A quick reminder: ROW_NUMBER is an analytical function in ANSI SQL 2003 supported by Oracle and MS SQL Server.
It enumerates each row in a resultset, but, unlike ROWNUM, may take two additional parameters: PARTITION BY and ORDER BY.
PARTITION BY splits a rowset into several partitions, each of them being numbered with its own sequence starting from 1.
ORDER BY defines the order the rows are numbered within each partition. This order may differ from the order the rows are returned in.
This function helps building queries which allow to select N rows for each partition.
Let’s create a sample table and see how we do it in PostgreSQL:
Read the rest of this entry »
PostgreSQL: row numbers
Note: this article concerns PostgreSQL 8.3 and below.
PostgreSQL 8.4 introduces window functions.
Window function ROW_NUMBER() implements the functionality in question more efficiently.
ROWNUM is a very useful pseudocolumn in Oracle that returns the position of each row in a final dataset.
Upcoming PostgreSQL 8.4 will have this pseudocolumn, but as for now will we need a hack to access it.
The main idea is simple:
- Wrap the query results into an array
- Join this array with a
generate_series()so that numbers from 1 toarray_upper()are returned - For each row returned, return this number (as
ROWNUM) along the corresponding array member (which is the row from the original query)
Let’s create a table with multiple columns of different datatypes, write a complex query and try to assign the ROWNUM to the query results:
Read the rest of this entry »
PostgreSQL: optimizing DISTINCT
In PostgreSQL (as of 8.3, at least), performance of DISTINCT clause in SELECT list is quite poor.
Probably because DISTINCT code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.
GROUP BY that can be used for the same purpose is more smart, as it employs more efficient HashAggregate, but its performance is still poor for large dataset.
All major RDBMS‘s, including MySQL, are able to jump over index keys to select DISTINCT values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT values.
This behavior can be emulated in PostgreSQL too.
Let’s create a sample table:
Read the rest of this entry »
Subscribe in a reader