EXPLAIN EXTENDED

Written by Quassnoi

July 24th, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL: last 12 weeks of a year

I have a query like this:
SELECT  EXTRACT(WEEK FROM j.updated_at) as "week",  count(j.id)
FROM    jobs
WHERE   EXTRACT(YEAR FROM j.updated_at)=2009
GROUP BY
EXTRACT(WEEK FROM j.updated_at)
ORDER BY
week
, which works fine, but I only want to show the last 12 weeks.

LIMIT 12 works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.

To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt
FROM    jobs
WHERE   EXTRACT(year FROM updated_at) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.

More than that, EXTRACT(year FROM updated) is not a sargable predicate, and all table rows (or index rows) need to be examined.

On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt
FROM    t_week
WHERE   EXTRACT(year FROM updated) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

week	cnt
41	3584
42	3467
43	3499
44	3535
45	3511
46	3621
47	3529
48	3500
49	3415
50	3536
51	3504
52	3479
12 rows fetched in 0.0004s (3.5843s)

Sort  (cost=58517.44..58517.47 rows=11 width=16)
  Sort Key: (date_part('week'::text, t_week.updated))
  ->  Limit  (cost=58517.11..58517.14 rows=11 width=8)
        ->  Sort  (cost=58517.11..58517.14 rows=11 width=8)
              Sort Key: (date_part('week'::text, t_week.updated))
              ->  HashAggregate  (cost=58516.75..58516.92 rows=11 width=8)
                    ->  Seq Scan on t_week  (cost=0.00..58491.75 rows=5000 width=8)
                          Filter: (date_part('year'::text, updated) = 2008::double precision)

However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.

Let's create a sample table and see how to do this:

Written by Quassnoi

July 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

Hierarchial queries in MySQL: identifying trees

Continuing on the theme of hierarchical queries in MySQL:

Assume we have a table with hierarchical structure like this:

treeitem
1
3
9
10
11
4
12
13
14
5
15
16
17
2
6
18
7
8
18 rows fetched in 0.0003s (0.0137s)

We have two trees here: one starting from 1, another one starting from 2.

The problem is: given any item, we should identify the whole tree this item belongs to, and return the whole tree in the hierarchical order.

This also can be easily done using hierarchical queries in MySQL.

In this article: Hierarchical queries in MySQL I shown how to implement a function that returnes tree items in correct order, being called sequentially.

This function is reentrable and keeps its state in session variables, one of which, @start_with, defines the parent element for the tree we want to build.

We have two problems here:

Given an item, define a root of the tree it belongs to
Build a whole tree, starting from the root

The first problem can be solved by iterating the linked list backwards, starting from the variable given as an input.

This article:

Sorting lists

describes how to do it in great detail, that's why I'll just put a query here.

Let's create the table described above:

Written by Quassnoi

July 22nd, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL: round-robin order

I have a bunch of URLs stored in a table waiting to be scraped by a script.

However, many of those URLs are from the same site. I would like to return those URLs in a site-friendly order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.

The database layout is something like this:
create table urls (
site varchar, -- holds e.g. www.example.com or stockoverflow.com
url varchar unique
);
Example result:
SELECT  url
FROM    urls
ORDER BY
        mysterious_round_robin_function(site)
url

http://www.example.com/some/file

http://stackoverflow.com/questions/ask

http://use.perl.org/

http://www.example.com/some/other/file

http://stackoverflow.com/tags

url
http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:

number	site	url
1	example.com	http://www.example.com/some/file
1	stackoverflow.com	http://stackoverflow.com/questions/ask
1	use.perl.org	http://use.perl.org/
2	example.com	http://www.example.com/some/other/file
2	stackoverflow.com	http://stackoverflow.com/tags

Good news: in new PostgreSQL 8.4, we just can use a window function to do this:

SELECT  url
FROM    urls
ORDER BY
ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site

url
a.example.com/url104
b.example.com/url1
c.example.com/url106

y.example.com/url102
z.example.com/url103
a.example.com/url130
b.example.com/url105
c.example.com/url132

l.example.com/url999
m.example.com/url974
1000 rows fetched in 0.0169s (0.0178s)

Yes, that simple.

Bad news: in PostgreSQL 8.3 and below, we still need to hack.

Let's create a sample table and see how:

Written by Quassnoi

July 21st, 2009 at 11:00 pm

Posted in PostgreSQL

Hierarchical data in MySQL: parents and children in one query

with 23 comments

Answering questions asked on the site.

Michael asks:

I was wondering how to implement a hierarchical query in MySQL (using the ancestry chains version) for a single row, such that it picks up the parents (if any) and any children (if any).

The idea is, I want to be able to jump in at any point, provide an Id of some sort, and be able to draw out the entire hierarchy for that Id, both upwards and downwards.

We need to combine two queries here:

Original hierarchical query that returns all descendants of a given id (a descendancy chain)
A query that would return all ancestors of a given id (an ancestry chain)

An id can have only one parent, that's why we can employ a linked list technique to build an ancestry chain, like shown in this article:

Sorting lists

Here's the query to to this (no functions required):

SELECT  @r AS _id,
         (
         SELECT  @r := parent
         FROM    t_hierarchy
         WHERE   id = _id
         ) AS parent,
         @l := @l + 1 AS lvl
 FROM    (
         SELECT  @r := 1218,
                 @l := 0,
                 @cl := 0
         ) vars,
         t_hierarchy h
WHERE    @r <> 0

To combine two queries, we can employ a simple UNION ALL.

The only problem that is left to preserve the correct level, since the ancestry chain query conts level backwards, and the hierarchical query will count it starting from zero.

Let's create a sample table and see what we get:
Read the rest of this entry »

Written by Quassnoi

July 20th, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL 8.4: sampling random rows

with 2 comments

On Jul 1, 2009, PostgreSQL 8.4 was released.

In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.

Other articles on new features of PostgreSQL 8.4:

Today, I'll show a way to sample random rows from a PRIMARY KEY preserved table.

Usually, if we need, say, 10 random rows from a table, we issue this query:

SELECT	*
FROM	t_random
ORDER BY
RANDOM()
LIMIT 10

PostgreSQL heavily optimizes this query, since it sees a LIMIT condition and does not sort all rows. Instead, it just keeps a running buffer which contains at most 10 rows with the least values or RANDOM calculated so far, and when a row small enough is met, it sorts only this buffer, not the whole set.

This is quite efficient, but still requires a full table scan.

This can be a problem, since the queries like that are often run frequently on heavily loaded sites (like for showing 10 random pages on social bookmarking systems), and full table scans will hamper performance significantly.

With new PosgreSQL 8.4 abilities to run recursive queries, this can be improved.

We can sample random values of the row ids and use an array to record previously selected values.

Let's create a sample table and see how can we imrove this query:
Read the rest of this entry »

Written by Quassnoi

July 18th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: preserving order for hierarchical query

with 3 comments

On Jul 1, 2009, PostgreSQL 8.4 was released.

In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.

Previously in the series:

Flattening timespans: PostgreSQL 8.4

Now, let's see how we can implement the hierarchical queries using the new features of PostgreSQL 8.4.

In PostgreSQL 8.3, we had to create a recursive function to do that. If you are still bound to 8.3 or an earlier version, you can read this article to see how to do it:

Hierarchical queries in PostgreSQL

In 8.4, we have recursive CTE's (common table expressions).

Let's create a sample hierarchical table and see how can we query it:
Read the rest of this entry »

Written by Quassnoi

July 17th, 2009 at 11:00 pm

Posted in PostgreSQL

INNER JOIN vs. CROSS APPLY

with 48 comments

Can anyone give me a good example of when CROSS APPLY makes a difference in those cases where INNER JOIN will work as well?

This is of course SQL Server.

A quick reminder on the terms.

INNER JOIN is the most used construct in SQL: it joins two tables together, selecting only those row combinations for which a JOIN condition is true.

This query:

SELECT  *
FROM    table1
JOIN    table2
ON      table2.b = table1.a

reads:

For each row from table1, select all rows from table2 where the value of field b is equal to that of field a

Note that this condition can be rewritten as this:

SELECT  *
FROM    table1, table2
WHERE   table2.b = table1.a

, in which case it reads as following:

Make a set of all possible combinations of rows from table1 and table2 and of this set select only the rows where the value of field b is equal to that of field a

These conditions are worded differently, but they yield the same result and database systems are aware of that. Usually both these queries are optimized to use the same execution plan.

The former syntax is called ANSI syntax, and it is generally considered more readable and is recommended to use.

However, it didn't make it into Oracle until recently, that's why there are many hardcore Oracle developers that are just used to the latter syntax.

Actually, it's a matter of taste.

To use JOINs (with whatever syntax), both sets you are joining must be self-sufficient, i. e. the sets should not depend on each other. You can query both sets without ever knowing the contents on another set.

But for some tasks the sets are not self-sufficient. For instance, let's consider the following query:

We have table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id

We cannot come up with a join condition here. The join condition, should it exist, would involve the row number, which is not present in table2, and there is no way to calculate a row number only from the values of columns of any given row in table2.

That's where the CROSS APPLY can be used.

CROSS APPLY is a Microsoft's extension to SQL, which was originally intended to be used with table-valued functions (TVF's).

The query above would look like this:

SELECT  *
FROM    table1
CROSS APPLY
(
SELECT  TOP (table1.rowcount) *
FROM    table2
ORDER BY
id
) t2

For each from table1, select first table1.rowcount rows from table2 ordered by id

The sets here are not self-sufficient: the query uses values from table1 to define the second set, not to JOIN with it.

The exact contents of t2 are not known until the corresponding row from table1 is selected.

I previously said that there is no way to join these two sets, which is true as long as we consider the sets as is. However, we can change the second set a little so that we get an additional computed field we can later join on.

The first option to do that is just count all preceding rows in a subquery:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*,
                (
                SELECT  COUNT(*)
                FROM    table2 t2i
                WHERE   t2i.id <= t2o.id
                ) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

The second option is to use a window function, also available in SQL Server since version 2005:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*, ROW_NUMBER() OVER (ORDER BY id) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

This function returns the ordinal number a row would have, be the ORDER BY condition used in the function applied to the whole query.

This is essentially the same result as the subquery used in the previous query.

Now, let's create the sample tables and check all these solutions for efficiency:

Written by Quassnoi

July 16th, 2009 at 11:00 pm

Posted in SQL Server

Oracle: OR on multiple EXISTS clauses