Archive for the ‘PostgreSQL’ Category
Adjacency list vs. nested sets: PostgreSQL
This series of articles is inspired by numerous questions asked on the site and on Stack Overflow.
What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?
First, let’s explain what all this means.
Adjacency list
Hierarchical relations (not to be confused with hierarchical data model) are 0-1:0-N transitive relations between entities of same domain.
For instance, ancestor-descendant relation is:
- Transitive:
- If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C
- Irreflexive:
- If A is an ancestor of B, then B is never an ancestor of A
0-1:0-N- A can have zero, one or many children. A can have zero or one parents.
These relations can be represented by an ordered directed tree.
Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.
Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.
This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that’s why this representation is called adjacency list.
Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model
and adjacency list model
have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.
Now, since we have a tree here which implies 0-1:0-N relationship between the vertices, we can define the relation as a self-relation: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a FOREIGN KEY reference to the table itself.
Since multple items can have no parents (and therefore be the roots of their trees), it’s sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.
This is a nice and elegant model, but until recently it had one drawback: it could not be used with SQL.
Read the rest of this entry »
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Let’s see for PostgreSQL 8.4.
To do this we will create two sample tables:
PostgreSQL: selecting items that belong to all categories
From Stack Overflow:
I’ve got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.
Let’s create a sample table:
Fallback language names: PostgreSQL
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
A quick reminder of the problem taken from Stack Overflow:
I have table
itemand another tablelanguagewhich contains names for theitemsin different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCEon twoSELECTlist subqueries - Use
COALESCEon the results of twoLEFT JOINS - Use the combination of methods above: a
LEFT JOINfor French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.
Now, let’s create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »
PostgreSQL: making duplicate records unique
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)), and increment the sequence number for this combinationSo, e. g., I’ll have two rows for
(txn#, detail#) = (1, 1), and haveseq#1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)has 3 duplicate rows, these would haveseq#1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid.
In fact, it’s the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT);
CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail);
INSERT
INTO t_dup (txn, detail, seq)
SELECT t, d, NULL
FROM generate_series(1, 100) t,
generate_series(1, 1000) d,
generate_series (1, 2) s;
ANALYZE t_dup;
Here’s the query to update it:
UPDATE ONLY t_dup
SET seq = rn
FROM (
SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn
FROM t_dup
) d
WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it’s unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).
Fortunately, ctids are comparable, and we can use less than
(<) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.
PostgreSQL 8.4: efficient MIN(COUNT(*))
Answering questions asked on the site.
Frances asks:
I have a table
user_viewswhich contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you’ll just have to group by myuser, then order by COUNT(*).
With new PostgreSQL 8.4, it’s possible to do this in one CTE, to avoid double evaluation:
WITH q AS
(
SELECT myuser, COUNT(*)
FROM user_pages
GROUP BY
myuser
)
SELECT *
FROM (
SELECT 'max'::TEXT AS which, myuser, count
FROM q
ORDER BY
count DESC
LIMIT 1
) qmax
UNION ALL
SELECT *
FROM (
SELECT 'min'::TEXT AS which, myuser, count
FROM q
ORDER BY
count
LIMIT 1
) qmin
However, it’s takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I’d like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let’s create a sample table:
PostgreSQL: last 12 weeks of a year
From Stack Overflow:
I have a query like this:
SELECT EXTRACT(WEEK FROM j.updated_at) as "week", count(j.id) FROM jobs WHERE EXTRACT(YEAR FROM j.updated_at)=2009 GROUP BY EXTRACT(WEEK FROM j.updated_at) ORDER BY week, which works fine, but I only want to show the last 12 weeks.
LIMIT 12works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.
To select last 12 rows in ascending order it’s enough to select first 12 rows in descending order and resort them in a subquery:
SELECT *
FROM (
SELECT EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt
FROM jobs
WHERE EXTRACT(year FROM updated_at) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week
However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.
More than that, EXTRACT(year FROM updated) is not a sargable predicate, and all table rows (or index rows) need to be examined.
On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:
SELECT *
FROM (
SELECT EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt
FROM t_week
WHERE EXTRACT(year FROM updated) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week
| week | cnt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41 | 3584 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 42 | 3467 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 43 | 3499 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 44 | 3535 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 45 | 3511 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 46 | 3621 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 47 | 3529 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 48 | 3500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 49 | 3415 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 50 | 3536 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 51 | 3504 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 52 | 3479 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 rows fetched in 0.0004s (3.5843s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sort (cost=58517.44..58517.47 rows=11 width=16)
Sort Key: (date_part('week'::text, t_week.updated))
-> Limit (cost=58517.11..58517.14 rows=11 width=8)
-> Sort (cost=58517.11..58517.14 rows=11 width=8)
Sort Key: (date_part('week'::text, t_week.updated))
-> HashAggregate (cost=58516.75..58516.92 rows=11 width=8)
-> Seq Scan on t_week (cost=0.00..58491.75 rows=5000 width=8)
Filter: (date_part('year'::text, updated) = 2008::double precision)
However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.
Let’s create a sample table and see how to do this:
PostgreSQL: round-robin order
From Stack Overflow:
I have a bunch of URLs stored in a table waiting to be scraped by a script.
However, many of those URLs are from the same site. I would like to return those URLs in a
site-friendlyorder (that is, try to avoid two URLs from the same site in a row) so I won’t be accidentally blocked by making too many HTTP requests in a short time.The database layout is something like this:
create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );Example result:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site)
url http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:
| number | site | url |
|---|---|---|
| 1 | example.com | http://www.example.com/some/file |
| 1 | stackoverflow.com | http://stackoverflow.com/questions/ask |
| 1 | use.perl.org | http://use.perl.org/ |
| 2 | example.com | http://www.example.com/some/other/file |
| 2 | stackoverflow.com | http://stackoverflow.com/tags |
Good news: in new PostgreSQL 8.4, we just can use a window function to do this:
SELECT url
FROM urls
ORDER BY
ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site
| url | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a.example.com/url104 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| b.example.com/url1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| c.example.com/url106 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| y.example.com/url102 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| z.example.com/url103 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| a.example.com/url130 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| b.example.com/url105 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| c.example.com/url132 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| l.example.com/url999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| m.example.com/url974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1000 rows fetched in 0.0169s (0.0178s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yes, that simple.
Bad news: in PostgreSQL 8.3 and below, we still need to hack.
Let’s create a sample table and see how:
PostgreSQL 8.4: sampling random rows
On Jul 1, 2009, PostgreSQL 8.4 was released.
In this series of articles, I’d like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.
Other articles on new features of PostgreSQL 8.4:
Today, I’ll show a way to sample random rows from a PRIMARY KEY preserved table.
Usually, if we need, say, 10 random rows from a table, we issue this query:
SELECT *
FROM t_random
ORDER BY
RANDOM()
LIMIT 10
PostgreSQL heavily optimizes this query, since it sees a LIMIT condition and does not sort all rows. Instead, it just keeps a running buffer which contains at most 10 rows with the least values or RANDOM calculated so far, and when a row small enough is met, it sorts only this buffer, not the whole set.
This is quite efficient, but still requires a full table scan.
This can be a problem, since the queries like that are often run frequently on heavily loaded sites (like for showing 10 random pages on social bookmarking systems), and full table scans will hamper performance significantly.
With new PosgreSQL 8.4 abilities to run recursive queries, this can be improved.
We can sample random values of the row ids and use an array to record previously selected values.
Let’s create a sample table and see how can we imrove this query:
Read the rest of this entry »
PostgreSQL 8.4: preserving order for hierarchical query
On Jul 1, 2009, PostgreSQL 8.4 was released.
In this series of articles, I’d like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.
Previously in the series:
Now, let’s see how we can implement the hierarchical queries using the new features of PostgreSQL 8.4.
In PostgreSQL 8.3, we had to create a recursive function to do that. If you are still bound to 8.3 or an earlier version, you can read this article to see how to do it:
In 8.4, we have recursive CTE‘s (common table expressions).
Let’s create a sample hierarchical table and see how can we query it:
Read the rest of this entry »
Subscribe in a reader