EXPLAIN EXTENDED

In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.

SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL, and each recursive step sees only a single record from the previous step.

Now, let's check the same operations in PostgreSQL 8.4.

To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.

This is a set-based operation.

Oracle's CONNECT BY syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.

In a true set-based system, this is much more simple.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

Shuffling rows: PostgreSQL

Answering questions asked on the site.

Josh asks:

I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.

This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.

Is it possible to do this with a single SQL query or I need to create a cursor?

This is in PostgreSQL 8.4

PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.

To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.

PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.

Here's what we should do to build the playlist:

We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
The base part of the CTE is just a random record (fetched with ORDER BY RANDOM() LIMIT 1)
The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the id of the random track just selected
The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator <@ (contained by)
The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it ([1:10])

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

October 6th, 2009 at 11:00 pm

Posted in PostgreSQL

Adjacency list vs. nested sets: PostgreSQL

with 13 comments

This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

First, let's explain what all this means.

Adjacency list

Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N transitive relations between entities of same domain.

For instance, ancestor-descendant relation is:

Transitive:
- If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
Antisymmetric:
- If A is an ancestor of B, then B is never an ancestor of A
Irreflexive:
- A is never an ancestor to itself
0-1:0-N
- A can have zero, one or many children. A can have zero or one parents.

These relations can be represented by an ordered directed tree.

Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.

Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.

This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called adjacency list.

Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model and adjacency list model have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.

Now, since we have a tree here which implies 0-1:0-N relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY reference to the table itself.

Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.

This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.
Read the rest of this entry »

Written by Quassnoi

September 24th, 2009 at 11:00 pm

Posted in PostgreSQL

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL

with 4 comments

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

Let's see for PostgreSQL 8.4.

To do this we will create two sample tables:

Written by Quassnoi

September 16th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: selecting items that belong to all categories

I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:

Project Category

Foo Apple

Foo Banana

Foo Carrot

Bar Apple

Bar Carrot

Qux Apple

Qux Banana

What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.

For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.

If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.

Project	Category
Foo	Apple
Foo	Banana
Foo	Carrot
Bar	Apple
Bar	Carrot
Qux	Apple
Qux	Banana

A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.

Let's create a sample table:

Written by Quassnoi

August 25th, 2009 at 11:00 pm

Posted in PostgreSQL

Fallback language names: PostgreSQL

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

Use COALESCE on two SELECT list subqueries
Use COALESCE on the results of two LEFT JOINS
Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

August 8th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: making duplicate records unique

In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.

For integration with another system, I need to add a column to the table that makes these duplicate rows unique.

The table is essentially:

txn# detail# amount

I could just use an incrementing seq# for every existing row, however, the application changes will be to take the existing almost key (basically (txn#, detail#)), and increment the sequence number for this combination

So, e. g., I'll have two rows for (txn#, detail#) = (1, 1), and have seq# 1 for the first and 2 for the second.

If (txn#, detail#) = (513, 44) has 3 duplicate rows, these would have seq# 1, 2 and 3 appropriately.

We just need to update the table with a condition that would distinguish between the rows somehow.

Fortunatly, every PostgreSQL table has an implicit primary key, ctid.

In fact, it's the pointer to the place in the datafile where the row data resides.

In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.

This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):

CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT);

CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail);

INSERT
INTO    t_dup (txn, detail, seq)
SELECT  t, d, NULL
FROM    generate_series(1, 100) t,
generate_series(1, 1000) d,
generate_series (1, 2) s;

ANALYZE t_dup;

Here's the query to update it:

UPDATE  ONLY t_dup
SET     seq = rn
FROM    (
SELECT  ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn
FROM    t_dup
) d
WHERE   t_dup.ctid = d.ctid

This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.

For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).

Fortunately, ctids are comparable, and we can use less than (<) operator on them to build an ordered set.

This is less efficient, but still works (if the keys are almost unique, i. e. there are not many duplicates of each key):

UPDATE	ONLY t_dup d
SET	seq =
(
SELECT	COUNT(*)
FROM	t_dup di
WHERE	di.txn = d.txn
AND di.detail = d.detail
AND di.ctid &lt;= d.ctid
	);

This takes a little bit longer (15 seconds), however still works.

Written by Quassnoi

July 30th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: efficient MIN(COUNT(*))

Answering questions asked on the site.

Frances asks:

I have a table user_views which contains pages viewed by users.

How do I select the user that had the most and the fewest number of page views?

If you need them at the same time, you'll just have to group by myuser, then order by COUNT(*).

With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:

WITH    q AS
(
SELECT  myuser, COUNT(*)
FROM    user_pages
GROUP BY
myuser
)
SELECT  *
FROM    (
SELECT  'max'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count DESC
LIMIT 1
) qmax
UNION ALL
SELECT  *
FROM    (
SELECT  'min'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count
LIMIT 1
) qmin

However, it's takes quite a long time if you table is big.

This is probably an answer to the question you asked.

But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.

The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.

Let's create a sample table:

Written by Quassnoi

July 29th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: last 12 weeks of a year

I have a query like this:
SELECT  EXTRACT(WEEK FROM j.updated_at) as "week",  count(j.id)
FROM    jobs
WHERE   EXTRACT(YEAR FROM j.updated_at)=2009
GROUP BY
EXTRACT(WEEK FROM j.updated_at)
ORDER BY
week
, which works fine, but I only want to show the last 12 weeks.

LIMIT 12 works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.

To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt
FROM    jobs
WHERE   EXTRACT(year FROM updated_at) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.

More than that, EXTRACT(year FROM updated) is not a sargable predicate, and all table rows (or index rows) need to be examined.

On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt
FROM    t_week
WHERE   EXTRACT(year FROM updated) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

week	cnt
41	3584
42	3467
43	3499
44	3535
45	3511
46	3621
47	3529
48	3500
49	3415
50	3536
51	3504
52	3479
12 rows fetched in 0.0004s (3.5843s)

Sort  (cost=58517.44..58517.47 rows=11 width=16)
  Sort Key: (date_part('week'::text, t_week.updated))
  ->  Limit  (cost=58517.11..58517.14 rows=11 width=8)
        ->  Sort  (cost=58517.11..58517.14 rows=11 width=8)
              Sort Key: (date_part('week'::text, t_week.updated))
              ->  HashAggregate  (cost=58516.75..58516.92 rows=11 width=8)
                    ->  Seq Scan on t_week  (cost=0.00..58491.75 rows=5000 width=8)
                          Filter: (date_part('year'::text, updated) = 2008::double precision)

However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.

Let's create a sample table and see how to do this:

Written by Quassnoi

July 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: round-robin order

I have a bunch of URLs stored in a table waiting to be scraped by a script.

However, many of those URLs are from the same site. I would like to return those URLs in a site-friendly order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.

The database layout is something like this:
create table urls (
site varchar, -- holds e.g. www.example.com or stockoverflow.com
url varchar unique
);
Example result:
SELECT  url
FROM    urls
ORDER BY
        mysterious_round_robin_function(site)
url

http://www.example.com/some/file

http://stackoverflow.com/questions/ask

http://use.perl.org/

http://www.example.com/some/other/file

http://stackoverflow.com/tags

url
http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:

number	site	url
1	example.com	http://www.example.com/some/file
1	stackoverflow.com	http://stackoverflow.com/questions/ask
1	use.perl.org	http://use.perl.org/
2	example.com	http://www.example.com/some/other/file
2	stackoverflow.com	http://stackoverflow.com/tags

Good news: in new PostgreSQL 8.4, we just can use a window function to do this:

SELECT  url
FROM    urls
ORDER BY
ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site

url
a.example.com/url104
b.example.com/url1
c.example.com/url106

y.example.com/url102
z.example.com/url103
a.example.com/url130
b.example.com/url105
c.example.com/url132

l.example.com/url999
m.example.com/url974
1000 rows fetched in 0.0169s (0.0178s)

Yes, that simple.

Bad news: in PostgreSQL 8.3 and below, we still need to hack.

Let's create a sample table and see how: