EXPLAIN EXTENDED

How to create fast database queries

Selecting N records for each group: PostgreSQL

Comments enabled. I *really* need your comment

In one of the previous articles:

Advanced row sampling

, I described how to select a certain number of records for each group in a MySQL table.

This is trivial in SQL Server and Oracle, since an analytic function ROW_NUMBER() can be used to do this.

Now, I'll describe how to do it in PostgreSQL.

We are assuming that the records from the table should be grouped by a field called grouper, and for each grouper, N first records should be selected, ordered first by ts, then by id.

Let's create a sample table:

CREATE TABLE t_limiter (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
ts TIMESTAMP NOT NULL,
value VARCHAR(50) NOT NULL
);

CREATE INDEX ix_limiter_grouper_ts_id ON t_limiter (grouper, ts, id);

BEGIN TRANSACTION;

SELECT  SETSEED(0.20090429);

INSERT
INTO    t_limiter (id, grouper, ts, value)
SELECT  s, s % 10,
NOW()::DATE - (RANDOM() * 1000000 || ' SECOND')::INTERVAL,
'Value ' || s
FROM    generate_series(1, 1000000) s;

COMMIT;

ANALYZE t_limiter;

A quick reminder of an algorithm to select a number of rows from a table.

  1. Select DISTINCT grouper's from the table
  2. For each grouper, select n'th row from the table, ordered by (ts, id). This will be used as a condition in a subsequent JOIN
  3. JOIN the table with the grouper's rowset selected on step 1, using the values selected on step 2 as an upper bound for the JOIN condition

PostgreSQL provides a convenient way to select whole row from a table as a single column.

This allows us to overcome the 1 column, 1 row limitation of a subquery in SELECT list, that required us to use two subqueries for the same solution in MySQL:

SELECT  l.*
FROM    (
SELECT  grouper,
(
SELECT  li
FROM    t_limiter li
WHERE   li.grouper = dl.grouper
ORDER BY
grouper, ts, id
OFFSET  2 LIMIT 1
) AS mid
FROM    (
SELECT  grouper
FROM    t_limiter lo
GROUP BY
grouper
) dl
) dlo
JOIN    t_limiter l
ON      l.grouper = dlo.grouper
AND ROW(l.ts, l.id) < ROW(COALESCE((mid).ts, 'Infinity'::TIMESTAMP), (mid).id)

id grouper ts value
162376 6 2009-04-18 10:13:47.607196 Value 162376
478746 6 2009-04-18 10:13:49.983931 Value 478746
442419 9 2009-04-18 10:13:39.550324 Value 442419
503499 9 2009-04-18 10:13:42.567343 Value 503499
801690 0 2009-04-18 10:13:22.968126 Value 801690
773960 0 2009-04-18 10:13:45.413 Value 773960
997563 3 2009-04-18 10:13:20.169501 Value 997563
367513 3 2009-04-18 10:13:24.497357 Value 367513
191665 5 2009-04-18 10:13:23.487338 Value 191665
672365 5 2009-04-18 10:14:11.426236 Value 672365
355777 7 2009-04-18 10:13:49.288233 Value 355777
374467 7 2009-04-18 10:14:02.877161 Value 374467
151891 1 2009-04-18 10:13:41.42042 Value 151891
564221 1 2009-04-18 10:14:01.598455 Value 564221
159122 2 2009-04-18 10:13:27.494353 Value 159122
419912 2 2009-04-18 10:13:41.553133 Value 419912
477894 4 2009-04-18 10:13:35.536324 Value 477894
372574 4 2009-04-18 10:13:36.006642 Value 372574
925608 8 2009-04-18 10:13:32.785197 Value 925608
70658 8 2009-04-18 10:13:39.880478 Value 70658
20 rows fetched in 0.0009s (0.4925s)
Nested Loop  (cost=19854.84..741196.47 rows=333329 width=28)
  ->  HashAggregate  (cost=19852.83..19852.92 rows=10 width=4)
        ->  Seq Scan on t_limiter lo  (cost=0.00..17352.86 rows=999986 width=4)
  ->  Index Scan using ix_limiter_grouper_ts_id on t_limiter l  (cost=2.02..4337.93 rows=33333 width=28)
        Index Cond: ((l.grouper = lo.grouper) AND (ROW(l.ts, l.id) < ROW(COALESCE(((subplan)).ts, 'infinity'::timestamp without time zone), ((subplan)).id)))
        SubPlan
          ->  Limit  (cost=0.67..1.01 rows=1 width=48)
                ->  Index Scan using ix_limiter_grouper_ts_id on t_limiter li  (cost=0.00..33606.54 rows=99999 width=48)
                      Index Cond: (grouper = $0)
          ->  Limit  (cost=0.67..1.01 rows=1 width=48)
                ->  Index Scan using ix_limiter_grouper_ts_id on t_limiter li  (cost=0.00..33606.54 rows=99999 width=48)
                      Index Cond: (grouper = $0)

Note the major differences between MySQL and PostgreSQL solutions:

  • We use GROUP BY in PostgreSQL instead of DISTINCT in MySQL.This is because PostgreSQL implementation of DISTINCT uses sorting and does not employ index scan.

    GROUP BY in PostgreSQL, on the other hand, uses HashAggregate, which is also inefficient, but not as inefficient as DISTINCT

  • We use one subselect instead of two in MySQL. This is because PostgreSQL allows selecting a whole record from the table as a single column
  • We use simple l.grouper = dlo.grouper condition instead of l.grouper <= dlo.grouper AND l.grouper >= dlo.grouper. Unlike MySQL, PostgreSQL is smart enough to use all three columns in an index range condition withouts these hacks
  • As a result, this PostgreSQL query looks much nicer than MySQL's one, but performs more poorly, due to PostgreSQL's inability to use an INDEX GROUP BY employed by MySQL to select distinct values.

    Written by Quassnoi

    April 29th, 2009 at 11:00 pm

    Posted in PostgreSQL

    Leave a Reply