How to create fast database queries

Archive for the ‘PostgreSQL’ Category

Searching for arbitrary portions of a date

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a Ruby on Rails application with a PostgreSQL database; several tables have created_at and updated_at timestamp attributes.

When displayed, those dates are formatted in the user's locale; for example, the timestamp 2009-10-15 16:30:00.435 becomes the string 15.10.2009 - 16:30 (the date format for this example being dd.mm.yyyy - hh.mm).

The requirement is that the user must be able to search for records by date, as if they were strings formatted in the current locale.

For example, searching for 15.10.2009 would return records with dates on October 15th 2009; searching for 15.10 would return records with dates on October 15th of any year, searching for 15 would return all dates that match 15 (be it day, month or year).

The simplest solution would be just retrieve the locale string from the client, format the dates according to that string and search them using LIKE or ~ operators (the latter, as we all know, searches for POSIX regular expressions).

However, this would be not very efficient.

Let's create a sample table and see:
Read the rest of this entry »

Written by Quassnoi

February 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: Selecting records holding group-wise maximum

Comments enabled. I *really* need your comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll describe several ways to do this in PostgreSQL 8.4.

PostgreSQL 8.4 syntax is much richer than that of MySQL. The former can use the analytic functions, recursive CTE's and proprietary syntax extensions, all of which can be used for this task.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 26th, 2009 at 11:00 pm

Posted in PostgreSQL

Recursive CTE’s: PostgreSQL

Comments enabled. I *really* need your comment

In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.

SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL, and each recursive step sees only a single record from the previous step.

Now, let's check the same operations in PostgreSQL 8.4.

To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.

This is a set-based operation.

Oracle's CONNECT BY syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.

In a true set-based system, this is much more simple.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

Shuffling rows: PostgreSQL

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Josh asks:

I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.

This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.

Is it possible to do this with a single SQL query or I need to create a cursor?

This is in PostgreSQL 8.4

PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.

To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.

PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.

Here's what we should do to build the playlist:

  1. We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
  2. The base part of the CTE is just a random record (fetched with ORDER BY RANDOM() LIMIT 1)
  3. The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the id of the random track just selected
  4. The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator <@ (contained by)
  5. The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it ([1:10])

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

October 6th, 2009 at 11:00 pm

Posted in PostgreSQL

Adjacency list vs. nested sets: PostgreSQL

with 8 comments

This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

First, let's explain what all this means.

Adjacency list

Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N transitive relations between entities of same domain.

For instance, ancestor-descendant relation is:

  • Transitive:
    • If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
  • Irreflexive:
    • If A is an ancestor of B, then B is never an ancestor of A
  • 0-1:0-N
    • A can have zero, one or many children. A can have zero or one parents.

These relations can be represented by an ordered directed tree.

Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.

Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.

This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called adjacency list.

Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model and adjacency list model have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.

Now, since we have a tree here which implies 0-1:0-N relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY reference to the table itself.

Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.

This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.
Read the rest of this entry »

Written by Quassnoi

September 24th, 2009 at 11:00 pm

Posted in PostgreSQL


with 2 comments

Which method is best to select values present in one table but missing in another one?


FROM    t_left l
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

FROM    t_left l
WHERE   l.value NOT IN
        SELECT  value
        FROM    t_right r

or this:

FROM    t_left l
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value

Let's see for PostgreSQL 8.4.

To do this we will create two sample tables:

Read the rest of this entry »

Written by Quassnoi

September 16th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: selecting items that belong to all categories

Comments enabled. I *really* need your comment

From Stack Overflow:

I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:

Project Category
Foo Apple
Foo Banana
Foo Carrot
Bar Apple
Bar Carrot
Qux Apple
Qux Banana

What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.

For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.

If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.

A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

August 25th, 2009 at 11:00 pm

Posted in PostgreSQL

Fallback language names: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

August 8th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: making duplicate records unique

Comments enabled. I *really* need your comment

From Stack Overflow:

In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.

For integration with another system, I need to add a column to the table that makes these duplicate rows unique.

The table is essentially:

txn# detail# amount

I could just use an incrementing seq# for every existing row, however, the application changes will be to take the existing almost key (basically (txn#, detail#)), and increment the sequence number for this combination

So, e. g., I'll have two rows for (txn#, detail#) = (1, 1), and have seq# 1 for the first and 2 for the second.

If (txn#, detail#) = (513, 44) has 3 duplicate rows, these would have seq# 1, 2 and 3 appropriately.

We just need to update the table with a condition that would distinguish between the rows somehow.

Fortunatly, every PostgreSQL table has an implicit primary key, ctid.

In fact, it's the pointer to the place in the datafile where the row data resides.

In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.

This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):

CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT);

CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail);

INTO    t_dup (txn, detail, seq)
FROM    generate_series(1, 100) t,
        generate_series(1, 1000) d,
        generate_series (1, 2) s;

ANALYZE t_dup;

Here's the query to update it:

SET     seq = rn
FROM    (
        SELECT  ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn
        FROM    t_dup
        ) d
WHERE   t_dup.ctid = d.ctid

This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.

For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).

Fortunately, ctids are comparable, and we can use less than (<) operator on them to build an ordered set.

This is less efficient, but still works (if the keys are almost unique, i. e. there are not many duplicates of each key):

SET	seq =
	FROM	t_dup di
	WHERE	di.txn = d.txn
		AND di.detail = d.detail
		AND di.ctid <= d.ctid

This takes a little bit longer (15 seconds), however still works.

Written by Quassnoi

July 30th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: efficient MIN(COUNT(*))

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Frances asks:

I have a table user_views which contains pages viewed by users.

How do I select the user that had the most and the fewest number of page views?

If you need them at the same time, you'll just have to group by myuser, then order by COUNT(*).

With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:

WITH    q AS
        SELECT  myuser, COUNT(*)
        FROM    user_pages
        GROUP BY
FROM    (
        SELECT  'max'::TEXT AS which, myuser, count
        FROM    q
        ORDER BY
                count DESC
        LIMIT 1
        ) qmax
FROM    (
        SELECT  'min'::TEXT AS which, myuser, count
        FROM    q
        ORDER BY
        LIMIT 1
        ) qmin

However, it's takes quite a long time if you table is big.

This is probably an answer to the question you asked.

But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.

The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

July 29th, 2009 at 11:00 pm

Posted in PostgreSQL