EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: row numbers

Comments enabled. I *really* need your comment

Note: this article concerns PostgreSQL 8.3 and below.

PostgreSQL 8.4 introduces window functions.

Window function ROW_NUMBER() implements the functionality in question more efficiently.

ROWNUM is a very useful pseudocolumn in Oracle that returns the position of each row in a final dataset.

Upcoming PostgreSQL 8.4 will have this pseudocolumn, but as for now will we need a hack to access it.

The main idea is simple:

  1. Wrap the query results into an array
  2. Join this array with a generate_series() so that numbers from 1 to array_upper() are returned
  3. For each row returned, return this number (as ROWNUM) along the corresponding array member (which is the row from the original query)

Let's create a table with multiple columns of different datatypes, write a complex query and try to assign the ROWNUM to the query results:

CREATE TABLE t_complex (
        int_value INT NOT NULL,
        double_value FLOAT NOT NULL,
        varchar_value VARCHAR(50) NOT NULL,
        timestamp_value TIMESTAMP NOT NULL
        );

BEGIN TRANSACTION;

SELECT  SETSEED(0.20090505);

INSERT
INTO    t_complex
SELECT  RANDOM() * 10 + 1,
        ((RANDOM() * 10 + 1)::INTEGER)::FLOAT / 10,
        'Value ' || (RANDOM() * 10 + 1)::INTEGER,
        '2009-05-05'::TIMESTAMP - (((RANDOM() * 10 + 1)::INTEGER) || ' day')::INTERVAL
FROM    generate_series(1, 100000) s;

COMMIT;

ANALYZE t_complex;

We created a table with an INT, a FLOAT, a VARCHAR and a TIMESTAMP, and filled it with 100,000 random values (including full duplicates).

Now, let's query this table:

SELECT  *
FROM    t_complex
WHERE   int_value IN (2, 3)
        AND double_value IN (0.4, 0.5)
ORDER BY
        int_value, double_value, varchar_value, timestamp_value
LIMIT 20

int_value double_value varchar_value timestamp_value
2 0.4 Value 1 2009-04-24 00:00:00
2 0.4 Value 1 2009-04-24 00:00:00
2 0.4 Value 1 2009-04-24 00:00:00
2 0.4 Value 1 2009-04-24 00:00:00
2 0.4 Value 1 2009-04-25 00:00:00
2 0.4 Value 1 2009-04-25 00:00:00
2 0.4 Value 1 2009-04-25 00:00:00
2 0.4 Value 1 2009-04-25 00:00:00
2 0.4 Value 1 2009-04-25 00:00:00
2 0.4 Value 1 2009-04-26 00:00:00
2 0.4 Value 1 2009-04-26 00:00:00
2 0.4 Value 1 2009-04-26 00:00:00
2 0.4 Value 1 2009-04-26 00:00:00
2 0.4 Value 1 2009-04-27 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
2 0.4 Value 1 2009-04-28 00:00:00
20 rows fetched in 0.0009s (0.0416s)

We have quite a complex query with WHERE, ORDER BY and LIMIT clauses.

Now, let's try to assign the ROWNUM to this query:

SELECT  ROWNUM, (a[ROWNUM]).*
FROM    (
        SELECT  a, generate_series(1, array_upper(a, 1)) AS ROWNUM
        FROM    (
                SELECT  ARRAY
                        (
                        SELECT  t_complex
                        FROM    t_complex
                        WHERE   int_value IN (2, 3)
                                AND double_value IN (0.4, 0.5)
                        ORDER BY
                                int_value, double_value, varchar_value, timestamp_value
                        LIMIT 20
                        ) AS a
                ) q2
        ) q3

rownum int_value double_value varchar_value timestamp_value
1 2 0.4 Value 1 2009-04-24 00:00:00
2 2 0.4 Value 1 2009-04-24 00:00:00
3 2 0.4 Value 1 2009-04-24 00:00:00
4 2 0.4 Value 1 2009-04-24 00:00:00
5 2 0.4 Value 1 2009-04-25 00:00:00
6 2 0.4 Value 1 2009-04-25 00:00:00
7 2 0.4 Value 1 2009-04-25 00:00:00
8 2 0.4 Value 1 2009-04-25 00:00:00
9 2 0.4 Value 1 2009-04-25 00:00:00
10 2 0.4 Value 1 2009-04-26 00:00:00
11 2 0.4 Value 1 2009-04-26 00:00:00
12 2 0.4 Value 1 2009-04-26 00:00:00
13 2 0.4 Value 1 2009-04-26 00:00:00
14 2 0.4 Value 1 2009-04-27 00:00:00
15 2 0.4 Value 1 2009-04-28 00:00:00
16 2 0.4 Value 1 2009-04-28 00:00:00
17 2 0.4 Value 1 2009-04-28 00:00:00
18 2 0.4 Value 1 2009-04-28 00:00:00
19 2 0.4 Value 1 2009-04-28 00:00:00
20 2 0.4 Value 1 2009-04-28 00:00:00
20 rows fetched in 0.0010s (0.0477s)

Works fine, despite the multiple datatypes in the table, query complexity and duplicates.

This solution, though, has some drawbacks:

  • Execution time is Θ(exp(N)). There must be a Schlemiel the painter's algortihm somewhere in the array parsing code.
  • The inner subquery should return a single column of a named type, this is:

    • A built-in database type
    • A custom type created with CREATE TYPE
    • A table rowtype

    If you look into the inner subquery of the ROWNUM query, you will see that it returns a single column of row type t_complex instead of 4 columns which original query returned. This column is wrapped into an ARRAY and is expanded in the outermost query.

    This expression:

    a.ROWNUM

    returns one column of row type, while this one:

    (a[ROWNUM]).*

    makes 4 columns of basic types out of that one column of row type.

    It is required for array creation. Unfortunately, PostgreSQL cannot create arrays of anonymous records.

    This, however, can be easily worked around by creating a named type or a view for the subquery

Apart from that, this solution (which is quite efficient for smaller tables) can be used to emulate ROW_NUMBER in PostgreSQL 8.3 and below.

Written by Quassnoi

May 5th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply