Archive for the ‘PostgreSQL’ Category
PostgreSQL: Selecting records holding group-wise maximum
Comments enabled. I *really* need your comment
Continuing the series on selecting records holding group-wise maximums:
How do I select the whole records, grouped on
grouper
and holding a group-wise maximum (or minimum) on other column?
In this article, I'll describe several ways to do this in PostgreSQL 8.4.
PostgreSQL 8.4 syntax is much richer than that of MySQL. The former can use the analytic functions, recursive CTE's and proprietary syntax extensions, all of which can be used for this task.
Let's create a sample table:
Read the rest of this entry »
Recursive CTE’s: PostgreSQL
Comments enabled. I *really* need your comment
In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.
SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL
, and each recursive step sees only a single record from the previous step.
Now, let's check the same operations in PostgreSQL 8.4.
To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.
This is a set-based operation.
Oracle's CONNECT BY
syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY
clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.
In a true set-based system, this is much more simple.
Let's create a sample table:
Read the rest of this entry »
Shuffling rows: PostgreSQL
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Josh asks:
I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.
This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.
Is it possible to do this with a single SQL query or I need to create a cursor?
This is in PostgreSQL 8.4
PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.
To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.
PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.
Here's what we should do to build the playlist:
- We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
- The base part of the CTE is just a random record (fetched with
ORDER BY RANDOM() LIMIT 1
) - The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the
id
of the random track just selected - The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator
<@
(contained by) - The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it (
[1:10]
)
Let's create a sample table:
Read the rest of this entry »
Adjacency list vs. nested sets: PostgreSQL
This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.
What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?
First, let's explain what all this means.
Adjacency list
Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N
transitive relations between entities of same domain.
For instance, ancestor-descendant relation is:
- Transitive:
- If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
- Antisymmetric:
- If A is an ancestor of B, then B is never an ancestor of A
- Irreflexive:
- A is never an ancestor to itself
0-1:0-N
- A can have zero, one or many children. A can have zero or one parents.
These relations can be represented by an ordered directed tree.
Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.
Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.
This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called adjacency list.
Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model
and adjacency list model
have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.
Now, since we have a tree here which implies 0-1:0-N
relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY
reference to the table itself.
Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.
This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.
Read the rest of this entry »
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Let's see for PostgreSQL 8.4.
To do this we will create two sample tables:
PostgreSQL: selecting items that belong to all categories
Comments enabled. I *really* need your comment
From Stack Overflow:
I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY
and COUNT
. However, this can be made more efficient.
Let's create a sample table:
Fallback language names: PostgreSQL
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
A quick reminder of the problem taken from Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »
PostgreSQL: making duplicate records unique
Comments enabled. I *really* need your comment
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#
for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)
), and increment the sequence number for this combinationSo, e. g., I'll have two rows for
(txn#, detail#) = (1, 1)
, and haveseq#
1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)
has 3 duplicate rows, these would haveseq#
1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid
.
In fact, it's the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq
to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail)
pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT); CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail); INSERT INTO t_dup (txn, detail, seq) SELECT t, d, NULL FROM generate_series(1, 100) t, generate_series(1, 1000) d, generate_series (1, 2) s; ANALYZE t_dup;
Here's the query to update it:
UPDATE ONLY t_dup SET seq = rn FROM ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn FROM t_dup ) d WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*)
.
Fortunately, ctid
s are comparable, and we can use less than
(<
) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.
PostgreSQL 8.4: efficient MIN(COUNT(*))
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frances asks:
I have a table
user_views
which contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you'll just have to group by myuser
, then order by COUNT(*)
.
With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:
WITH q AS ( SELECT myuser, COUNT(*) FROM user_pages GROUP BY myuser ) SELECT * FROM ( SELECT 'max'::TEXT AS which, myuser, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, myuser, count FROM q ORDER BY count LIMIT 1 ) qmin
However, it's takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*)
calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let's create a sample table:
PostgreSQL: last 12 weeks of a year
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a query like this:
SELECT EXTRACT(WEEK FROM j.updated_at) as "week", count(j.id) FROM jobs WHERE EXTRACT(YEAR FROM j.updated_at)=2009 GROUP BY EXTRACT(WEEK FROM j.updated_at) ORDER BY week, which works fine, but I only want to show the last 12 weeks.
LIMIT 12
works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.
To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:
SELECT * FROM ( SELECT EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt FROM jobs WHERE EXTRACT(year FROM updated_at) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.
More than that, EXTRACT(year FROM updated)
is not a sargable predicate, and all table rows (or index rows) need to be examined.
On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:
SELECT * FROM ( SELECT EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt FROM t_week WHERE EXTRACT(year FROM updated) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
week | cnt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | 3584 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42 | 3467 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
43 | 3499 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
44 | 3535 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45 | 3511 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46 | 3621 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
47 | 3529 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | 3500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | 3415 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | 3536 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
51 | 3504 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
52 | 3479 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 rows fetched in 0.0004s (3.5843s) |
Sort (cost=58517.44..58517.47 rows=11 width=16) Sort Key: (date_part('week'::text, t_week.updated)) -> Limit (cost=58517.11..58517.14 rows=11 width=8) -> Sort (cost=58517.11..58517.14 rows=11 width=8) Sort Key: (date_part('week'::text, t_week.updated)) -> HashAggregate (cost=58516.75..58516.92 rows=11 width=8) -> Seq Scan on t_week (cost=0.00..58491.75 rows=5000 width=8) Filter: (date_part('year'::text, updated) = 2008::double precision)
However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.
Let's create a sample table and see how to do this: