EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘PostgreSQL’ Category

PostgreSQL: parametrizing a recursive CTE

with 2 comments

Answering questions asked on the site.

Jan Suchal asks:

We've started playing with PostgreSQL and recursive queries. Looking at example that does basic graph traversal from http://www.postgresql.org/docs/9.0/static/queries-with.html.

We would like to have a view called paths defined like this:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph

By calling

SELECT  *
FROM    paths
WHERE   path[1] = :node_id

we would get all paths from a certain node.

The problem here is with performance. When you want this to be quick you need to add a condition for the anchor part of the UNION like this:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
WHERE   id = :node_id
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph

Now it's perfectly fast, but we cannot create a view because that would only contain paths from one specific node.

Any ideas?

An often overlooked feature of PostgreSQL is its ability to create set-returning functions and use them in SELECT list.

The record will be cross-joined with the set returned by the function and the result of the join will be added to the resultset.

This is best demonstrated with generate_series, probably a most used PostgreSQL set-returning function.
Read the rest of this entry »

Written by Quassnoi

December 24th, 2010 at 11:00 pm

Posted in PostgreSQL

Date ranges: overlapping with priority

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Jason Foster asks:

We have a table of student registrations:

Students
student_code course_code course_section session_cd
987654321 ESC102H1 Y 20085
998766543 ELEE203H F 20085

course_code and course_section identify a course, session_cd is an academic session, e. g. 20085, 20091, 20079.

The courses (stored in another table) have associated values for engineering design, complementary studies, etc., like that:

Courses
course_code course_section start_session end_session design science studies
ESC102H1 F 20071 20099 10 0 0
AER201Y1 Y 20059 NULL 0 0 30

, or like that:

In-house courses
course_code course_section student_code design science studies
ESC102H1 F 998766543 10 0 0

We are required by an external accreditation body to add up all of the units of engineering design, complementary studies, etc., taken by an individual student.

Where it gets really messy is that we have multiple data feeds for the associated values of courses. For example we have a set from the Registrar's Office, the Civil Department, our In-House version, etc.

The rule is that In-House beats Civil beats the Registrar's Office in the case of any duplication within the overlapping intervals.

The session_cd is of the form YYYY{1,5,9}.

Basically, we have three sets here.

To get the course hours for a given student we should find a record for him in the in-house set, or, failing that, find if the session is within the ranges of one of the external sets (Civil or Registrar). If both ranges contain the academic session the student took the course, Civil should be taken.

The first part is quite simple: we just LEFT JOIN students with the in-house courses and get the hours for the courses which are filled. The real problem is the next part: searching for the ranges containing a given value.

As I already mentioned in the previous posts, relational databases are not in general that efficient for the queries like that. It's easy to use an index to find a value of a column within a given range, but B-Tree indexes are of little help in searching for a range of two columns containing a given value.

However, in this case, the data domain of session_cd is quite a limited set. For each pair of session_start and session_end it is easy to create a set of all possible values between session_start and session_end.

The overlapping parts of the session ranges from the two sets will yields two records for each of the sessions belonging to the range. Of these two records we will need to take the relevant one (that is Civil) by using DISTINCT ON with the additional sorting on the source (Civil goes first).

Then we just join the relevant records to the subset of the students which does not have corresponding records in the in-house version.

Finally, we need to union this with the in-house recordset.
Read the rest of this entry »

Written by Quassnoi

April 7th, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: using recursive functions in nested sets

with one comment

In the previous article, I discussed a way to improve nested sets model in PostgreSQL.

The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.

This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)).

That solution was pure SQL and it was quite fast, but not optimal.

The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.

We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: nested sets and R-Tree

Comments enabled. I *really* need your comment

A feedback on one of my previous articles comparing adjacency list and nested sets models for PostgreSQL.

Jay writes:

In your series on adjacency lists vs nested sets, you discuss geometric types and R-Tree indexes in MySQL, but you don't discuss them when discussing the same subject with PostgreSQL, which also has geometric types and R-Tree indexing (mostly available through GiST indexes).

To make it simple I added the following line after the data insertion part of the script at Nested Sets - Postgresql:

ALTER TABLE t_hierarchy ADD COLUMN sets POLYGON;
UPDATE t_hierarchy SET sets = POLYGON(BOX(POINT(-1,lft), POINT(1, rgt)));

It needed to be a POLYGON instead of a BOX since there is a @>(POLYGON,POLYGON) function but no @>(BOX,BOX) function, and the polygon was cast from the box to create the rectangle shape required.

It outperforms the adjacency list on all descendants; outperforms it on all ancestors (not by much); performs reasonably well on all descendants up to a certain level on items with few descendants (e. g. 31415) and badly on items with many descendants (e. g. 42).

It still completes in less than 20 seconds though, which is an improvement over 1 minute.

PostgreSQL does support R-Tree indexes indeed (through GiST interface), and they can be used to improve the efficiency of the nested sets model.

Let's create a sample table and try some of the queries that Jay proposed:
Read the rest of this entry »

Written by Quassnoi

March 1st, 2010 at 11:00 pm

Posted in PostgreSQL

Six degrees of separation

with one comment

Answering questions asked on the site.

Kathy asks:

I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.

If your site grows popular, most probably, they are not. But we better check.

On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.

The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

February 27th, 2010 at 11:00 pm

Posted in PostgreSQL

Sargability of monotonic functions: example

with 2 comments

In my previous article I presented a proposal to add sargability of monotonic functions into the SQL engines.

In a nutshell: a monotonic function is a function that preserves the order of the argument so that it gives the larger results for the larger values of the argument. It is easy to prove that a B-tree with each key replaced by the result of the function will remain the valid B-Tree and hence can be used to search for ranges of function results just like it is used to search for ranges of values.

With a little effort, a B-Tree can also be used to search for the ranges of piecewise monotonic functions: those whose domain can be split into a number of continuous pieces with the function being monotonic within each piece (but it may be not monotonic and even not continuous across the pieces).

In this article, I'll demonstrate the algorithm to do that (implemented in pure SQL on PostgreSQL).

I will show how the performance of simple query

SELECT  *
FROM    t_sine
WHERE   SIN(value) BETWEEN 0.1234 AND 0.1235

could be improved if the sargability of monotonic functions had been implemented in the optimizer.
Read the rest of this entry »

Written by Quassnoi

February 23rd, 2010 at 11:00 pm

Posted in PostgreSQL

Searching for arbitrary portions of a date

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a Ruby on Rails application with a PostgreSQL database; several tables have created_at and updated_at timestamp attributes.

When displayed, those dates are formatted in the user's locale; for example, the timestamp 2009-10-15 16:30:00.435 becomes the string 15.10.2009 - 16:30 (the date format for this example being dd.mm.yyyy - hh.mm).

The requirement is that the user must be able to search for records by date, as if they were strings formatted in the current locale.

For example, searching for 15.10.2009 would return records with dates on October 15th 2009; searching for 15.10 would return records with dates on October 15th of any year, searching for 15 would return all dates that match 15 (be it day, month or year).

The simplest solution would be just retrieve the locale string from the client, format the dates according to that string and search them using LIKE or ~ operators (the latter, as we all know, searches for POSIX regular expressions).

However, this would be not very efficient.

Let's create a sample table and see:
Read the rest of this entry »

Written by Quassnoi

February 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: Selecting records holding group-wise maximum

Comments enabled. I *really* need your comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll describe several ways to do this in PostgreSQL 8.4.

PostgreSQL 8.4 syntax is much richer than that of MySQL. The former can use the analytic functions, recursive CTE's and proprietary syntax extensions, all of which can be used for this task.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 26th, 2009 at 11:00 pm

Posted in PostgreSQL

Recursive CTE’s: PostgreSQL

Comments enabled. I *really* need your comment

In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.

SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL, and each recursive step sees only a single record from the previous step.

Now, let's check the same operations in PostgreSQL 8.4.

To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.

This is a set-based operation.

Oracle's CONNECT BY syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.

In a true set-based system, this is much more simple.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

Shuffling rows: PostgreSQL

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Josh asks:

I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.

This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.

Is it possible to do this with a single SQL query or I need to create a cursor?

This is in PostgreSQL 8.4

PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.

To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.

PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.

Here's what we should do to build the playlist:

  1. We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
  2. The base part of the CTE is just a random record (fetched with ORDER BY RANDOM() LIMIT 1)
  3. The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the id of the random track just selected
  4. The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator <@ (contained by)
  5. The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it ([1:10])

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

October 6th, 2009 at 11:00 pm

Posted in PostgreSQL