EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘PostgreSQL’ Category

PostgreSQL: round-robin order

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a bunch of URLs stored in a table waiting to be scraped by a script.

However, many of those URLs are from the same site. I would like to return those URLs in a site-friendly order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.

The database layout is something like this:

create table urls (
site varchar, -- holds e.g. www.example.com or stockoverflow.com
url varchar unique
);

Example result:

SELECT  url
FROM    urls
ORDER BY
        mysterious_round_robin_function(site)
url
http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:

number site url
1 example.com http://www.example.com/some/file
1 stackoverflow.com http://stackoverflow.com/questions/ask
1 use.perl.org http://use.perl.org/
2 example.com http://www.example.com/some/other/file
2 stackoverflow.com http://stackoverflow.com/tags

Good news: in new PostgreSQL 8.4, we just can use a window function to do this:

SELECT  url
FROM    urls
ORDER BY
ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site
url
a.example.com/url104
b.example.com/url1
c.example.com/url106
y.example.com/url102
z.example.com/url103
a.example.com/url130
b.example.com/url105
c.example.com/url132
l.example.com/url999
m.example.com/url974
1000 rows fetched in 0.0169s (0.0178s)

Yes, that simple.

Bad news: in PostgreSQL 8.3 and below, we still need to hack.

Let's create a sample table and see how:

Read the rest of this entry »

Written by Quassnoi

July 21st, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: sampling random rows

with 2 comments

On Jul 1, 2009, PostgreSQL 8.4 was released.

In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.

Other articles on new features of PostgreSQL 8.4:

Today, I'll show a way to sample random rows from a PRIMARY KEY preserved table.

Usually, if we need, say, 10 random rows from a table, we issue this query:

SELECT	*
FROM	t_random
ORDER BY
RANDOM()
LIMIT 10

PostgreSQL heavily optimizes this query, since it sees a LIMIT condition and does not sort all rows. Instead, it just keeps a running buffer which contains at most 10 rows with the least values or RANDOM calculated so far, and when a row small enough is met, it sorts only this buffer, not the whole set.

This is quite efficient, but still requires a full table scan.

This can be a problem, since the queries like that are often run frequently on heavily loaded sites (like for showing 10 random pages on social bookmarking systems), and full table scans will hamper performance significantly.

With new PosgreSQL 8.4 abilities to run recursive queries, this can be improved.

We can sample random values of the row ids and use an array to record previously selected values.

Let's create a sample table and see how can we imrove this query:
Read the rest of this entry »

Written by Quassnoi

July 18th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: preserving order for hierarchical query

with 3 comments

On Jul 1, 2009, PostgreSQL 8.4 was released.

In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.

Previously in the series:

Now, let's see how we can implement the hierarchical queries using the new features of PostgreSQL 8.4.

In PostgreSQL 8.3, we had to create a recursive function to do that. If you are still bound to 8.3 or an earlier version, you can read this article to see how to do it:

In 8.4, we have recursive CTE's (common table expressions).

Let's create a sample hierarchical table and see how can we query it:
Read the rest of this entry »

Written by Quassnoi

July 17th, 2009 at 11:00 pm

Posted in PostgreSQL

Flattening timespans: PostgreSQL 8.4

with one comment

On Jul 1, 2009, PostgreSQL 8.4 was released.

Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE's).

Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer's life much, much easier.

Let's check how efficienly these features are implemented.

To do this, I'll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.

I'll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:

This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 14th, 2009 at 11:00 pm

Posted in PostgreSQL

Overlapping ranges: PostgreSQL

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Finally, PostgreSQL.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)

Now, let's create a sample table in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

July 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

Flattening timespans: PostgreSQL

Comments enabled. I *really* need your comment

In previous three articles:

, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.

Today, I'll describe how to do it in PostgreSQL.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.

Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.

We should create a set-returning function, open a cursor inside it and select all timespans ordered by start, keeping track of a running maximum of stop.

Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.

Here's the sample table:
Read the rest of this entry »

Written by Quassnoi

June 14th, 2009 at 11:00 pm

Posted in PostgreSQL

Longest common prefix: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on how to strip all strings in a set of their longest common prefix and concatenate the results:

Today, I'll show how to do it in PostgreSQL.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

Like with SQL Server, it is possible to do this in PostgreSQL using a single SQL query.

But since PostgreSQL offers a nice ability to create custom aggregates, I'll better demonstrate how to solve this task using ones.

In my opinion, custom aggregates fit here just perfectly.

Since PostgreSQL lacks native support for aggregate concatenation, we will need to create two custom aggregates here:
Read the rest of this entry »

Written by Quassnoi

June 5th, 2009 at 11:00 pm

Posted in PostgreSQL

Aggregate AND

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table with a foreign key and a boolean value (and a bunch of other columns that aren't relevant here), as such:

CREATE TABLE myTable
(        someKey integer,
        someBool boolean
);
INSERT
INTO    myTable
VALUES
        (1, 't'),
        (1, 't'),
        (2, 'f'),
        (2, 't');

Each someKey could have 0 or more entries.

For any given someKey, I need to know if

  1. All the entries are true, or
  2. Any of the entries are false

Basically, it's an AND.

This solution is often used to represent polls that should be unanimous for the decision to be made (i. e. anyone can put a veto on the decision).

PostgreSQL offers a special aggregate BOOL_AND to do this.

However, an aggregate may be less efficient here.

The return value of an AND function is constrained by finding certain values:

  1. Whenever a FALSE is found, the return value cannot be TRUE anymore. It's either FALSE or NULL.
  2. Whenever a NULL is found, the return value is NULL

Aggregates in PostgreSQL, however, won't take this into account.

What we need here is a method to stop and return whenever first NULL or FALSE value is found.

Let's create a sample table and see how can it may be done:
Read the rest of this entry »

Written by Quassnoi

May 30th, 2009 at 11:00 pm

Posted in PostgreSQL

Hierarchical queries in PostgreSQL

Comments enabled. I *really* need your comment

Note: this article concerns PostgreSQL 8.3 and earlier.

For hierarchical queries in PostgreSQL 8.4 and higher, see this article:

In one of the previous articles I wrote about using hierarchical queries in MySQL:

PostgreSQL has a contrib module to implement the same functionality.

However, it's not always possible to install and use contribs. Same is true for procedural languages.

Fortunately, this functionality can be implemented using a plain SQL function.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

May 29th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: selecting a function

Comments enabled. I *really* need your comment

From Stack Overflow:

Hello,

I want to write a SELECT statement as follows:

SELECT  field_a
FROM    my_table
WHERE   field_b IN (my_function(field_c))

Is that possible?

Would my_function have to return an array?

It is of course possible in PostgreSQL using a set returning function.

Return type of such a function should be declared as SETOF, so the function will return a rowset of given type.

There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.

Let's create sample tables:
Read the rest of this entry »

Written by Quassnoi

May 14th, 2009 at 11:00 pm

Posted in PostgreSQL