PostgreSQL: round-robin order at EXPLAIN EXTENDED

I have a bunch of URLs stored in a table waiting to be scraped by a script.

However, many of those URLs are from the same site. I would like to return those URLs in a site-friendly order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.

The database layout is something like this:
create table urls (
site varchar, -- holds e.g. www.example.com or stockoverflow.com
url varchar unique
);
Example result:
SELECT  url
FROM    urls
ORDER BY
        mysterious_round_robin_function(site)
url

http://www.example.com/some/file

http://stackoverflow.com/questions/ask

http://use.perl.org/

http://www.example.com/some/other/file

http://stackoverflow.com/tags

url
http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:

number	site	url
1	example.com	http://www.example.com/some/file
1	stackoverflow.com	http://stackoverflow.com/questions/ask
1	use.perl.org	http://use.perl.org/
2	example.com	http://www.example.com/some/other/file
2	stackoverflow.com	http://stackoverflow.com/tags

Good news: in new PostgreSQL 8.4, we just can use a window function to do this:

SELECT  url
FROM    urls
ORDER BY
ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site

url
a.example.com/url104
b.example.com/url1
c.example.com/url106

y.example.com/url102
z.example.com/url103
a.example.com/url130
b.example.com/url105
c.example.com/url132

l.example.com/url999
m.example.com/url974
1000 rows fetched in 0.0169s (0.0178s)

Yes, that simple.

Bad news: in PostgreSQL 8.3 and below, we still need to hack.

Let's create a sample table and see how:

CREATE TABLE urls (
id INT NOT NULL PRIMARY KEY,
site VARCHAR(255) NOT NULL,
url VARCHAR(511) NOT NULL
);

CREATE INDEX ix_urls_site_url ON urls (site, url);

INSERT
INTO    urls
SELECT  s,
CHR(97 + s % 26) || '.example.com',
CHR(97 + s % 26) || '.example.com/url' || s
FROM    generate_series (1, 1000) s;

ANALYZE urls;

The table contains 1,000 URLs from 26 sites (from a.example.com to z.example.com).

Since we don't have a ROW_NUMBER in PostgreSQL 8.3, we need to emulate it.

Some time ago I wrote two articles on how to do this:

, but both these solutions use large arrays, and PostgreSQL is not very good in handling them.

Instead, we can use PostgreSQL's nice capability to use fields in LIMIT / OFFSET clauses of a correlated subquery.

We need to do the following:

Select a grouped set of sites along with COUNT(*) for each site
Cross apply generate_series to each site, with a COUNT(*) as a parameter to generate_series. This will give us a rowset similar to original one, but with ordinal numbers instead of URLs
For each row (site, n) from this rowset, select n'th URL for this site. This can be done in a subquery, providing n as a parameter to OFFSET n LIMIT 1
Order first by n, then by site

Here's the query:

SELECT  (
SELECT  url
FROM    urls ui
WHERE   ui.site = uo.site
ORDER BY
url
OFFSET  n
LIMIT   1
) url
FROM    (
SELECT  site, generate_series(0, COUNT(*) - 1) n
FROM    urls
GROUP BY
site
) uo
ORDER BY
n, site

url
a.example.com/url104
b.example.com/url1
c.example.com/url106

y.example.com/url102
z.example.com/url103
a.example.com/url130
b.example.com/url105
c.example.com/url132

l.example.com/url999
m.example.com/url974
1000 rows fetched in 0.0170s (0.0363s)

Sort  (cost=165.19..165.26 rows=26 width=524)
  Sort Key: uo.n, uo.site
  ->  Subquery Scan uo  (cost=24.00..164.58 rows=26 width=524)
        ->  HashAggregate  (cost=24.00..24.46 rows=26 width=14)
              ->  Seq Scan on urls  (cost=0.00..19.00 rows=1000 width=14)
        SubPlan 1
          ->  Limit  (cost=4.30..5.38 rows=1 width=20)
                ->  Index Scan using ix_urls_site_url on urls ui  (cost=0.00..40.88 rows=38 width=20)
                      Index Cond: ((site)::text = ($0)::text)

This query also works fine and produces exactly same results.

Note, however, that the execution plan for this query is a classical example of a Schlemiel the painter's algorithm.

For each n, the query needs to select an OFFSET more and more far away from the beginning, each time unwinding the query and starting from the beginning.

That's why this query efficiency is O(n^2), and it will perform very poorly on large tables.

However, for a small table like ours it is just fine.

Written by Quassnoi

July 21st, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED

PostgreSQL: round-robin order

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow