Archive for July 21st, 2009
From Stack Overflow:
I have a bunch of URLs stored in a table waiting to be scraped by a script.
However, many of those URLs are from the same site. I would like to return those URLs in asite-friendlyorder (that is, try to avoid two URLs from the same site in a row) so I won’t be accidentally blocked by making too many HTTP requests in a short time.
The database layout is something like this:create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );
Example result:SELECT url FROM urls ORDER BY mysterious_round_robin_function(site)
url http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:
Good news: in new PostgreSQL 8.4, we just can use a window function to do this:
SELECT url FROM urls ORDER BY ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site
|1000 rows fetched in 0.0169s (0.0178s)|
Yes, that simple.
Bad news: in PostgreSQL 8.3 and below, we still need to hack.
Let’s create a sample table and see how: