EXPLAIN EXTENDED

How to create fast database queries

GROUP_CONCAT in PostgreSQL without aggregate functions

with one comment

In one of the previous articles:

Aggregate concatenation

, I described an aggregate function to concatenate strings in PostgreSQL, similar to GROUP_CONCAT in MySQL.

It's very useful if you have a complex GROUP BY query with multiple conditions.

But for some simple queries it's possible to emulate GROUP_CONCAT with pure SQL, avoiding custom functions at all.

Let's create a table to demonstrate our task:

CREATE TABLE t_grouper (
        id INT NOT NULL PRIMARY KEY,
        grouper INT NOT NULL,
        value VARCHAR(50) NOT NULL
        );

CREATE INDEX ix_limiter_grouper ON t_grouper (grouper);

BEGIN TRANSACTION;

SELECT  SETSEED(0.20090502);

INSERT
INTO    t_grouper (id, grouper, value)
SELECT  s, (s - 1) % 10 + 1,
        CASE FLOOR(RANDOM() * 10)
                WHEN 0 THEN 'lorem'
                WHEN 1 THEN 'ipsum'
                WHEN 2 THEN 'dolor'
                WHEN 3 THEN 'sit'
                WHEN 4 THEN 'amet'
                WHEN 5 THEN 'consecteur'
                WHEN 6 THEN 'adipisicing'
                WHEN 7 THEN 'elit'
                WHEN 8 THEN 'sed'
                WHEN 9 THEN 'do'
        END
FROM    generate_series(1, 100) s;

COMMIT;

ANALYZE t_grouper;

See what do we have in this table:

SELECT  *
FROM    t_grouper

id grouper value
1 1 ipsum
2 2 adipisicing
3 3 do
4 4 do
5 5 sit
6 6 amet
7 7 dolor
8 8 dolor
9 9 sit
10 10 sit
11 1 lorem
12 2 dolor
98 8 ipsum
99 9 lorem
100 10 sit
100 rows fetched in 0.0033s (0.0022s)

There are 10 groups, with 10 random words in each. There is an index on grouper.

To emulate the GROUP_CONCAT, we just need to use PostgreSQL's array manupulation functions:

  1. Select DISTINCT grouper's
  2. For each grouper, select all value's in a subquery
  3. Convert the subquery's resultset into an array
  4. Use array_to_string to concatenate the value's

Here's the query to concatenate strings in a group:

SELECT  grouper,
        array_to_string
        (
        ARRAY   (
                SELECT  value
                FROM    t_grouper gi
                WHERE   gi.grouper = gd.grouper
                ),
        ', '
        ) AS group_concat
FROM    (
        SELECT  grouper
        FROM    t_grouper
        GROUP BY
                grouper
        ) gd
ORDER BY
        grouper
grouper group_concat
1 ipsum, lorem, sit, consecteur, adipisicing, amet, ipsum, sit, amet, consecteur
2 adipisicing, dolor, dolor, amet, sit, amet, ipsum, lorem, dolor, lorem
3 do, elit, amet, sit, adipisicing, adipisicing, do, dolor, consecteur, adipisicing
4 do, dolor, lorem, lorem, do, sit, adipisicing, sed, adipisicing, sit
5 sit, sit, elit, ipsum, do, consecteur, ipsum, lorem, adipisicing, sit
6 amet, sit, do, adipisicing, dolor, ipsum, elit, elit, sit, do
7 dolor, do, sit, consecteur, do, elit, elit, elit, ipsum, adipisicing
8 dolor, do, sit, sit, consecteur, elit, do, do, adipisicing, ipsum
9 sit, sit, amet, do, consecteur, do, ipsum, elit, consecteur, lorem
10 sit, sit, sit, adipisicing, sit, ipsum, do, consecteur, elit, sit
10 rows fetched in 0.0004s (0.0037s)
Sort  (cost=25.14..25.17 rows=10 width=4)
  Sort Key: gd.grouper
  ->  Subquery Scan gd  (cost=2.25..24.97 rows=10 width=4)
        ->  HashAggregate  (cost=2.25..2.35 rows=10 width=4)
              ->  Seq Scan on t_grouper  (cost=0.00..2.00 rows=100 width=4)
        SubPlan
          ->  Seq Scan on t_grouper gi  (cost=0.00..2.25 rows=10 width=6)
                Filter: (grouper = $0)

Note that string here come in no particular order (strictly speaking, they come in the table order, as the Seq Scan is chosen by the query optimizer).

To force order, we need to add an ORDER BY condition into the subquery:

SELECT  grouper,
        array_to_string
        (
        ARRAY   (
                SELECT  value
                FROM    t_grouper gi
                WHERE   gi.grouper = gd.grouper
                ORDER BY
                        value
                ),
        ', '
        ) AS group_concat
FROM    (
        SELECT  grouper
        FROM    t_grouper
        GROUP BY
                grouper
        ) gd
ORDER BY
        grouper
grouper group_concat
1 adipisicing, amet, amet, consecteur, consecteur, ipsum, ipsum, lorem, sit, sit
2 adipisicing, amet, amet, dolor, dolor, dolor, ipsum, lorem, lorem, sit
3 adipisicing, adipisicing, adipisicing, amet, consecteur, do, do, dolor, elit, sit
4 adipisicing, adipisicing, do, do, dolor, lorem, lorem, sed, sit, sit
5 adipisicing, consecteur, do, elit, ipsum, ipsum, lorem, sit, sit, sit
6 adipisicing, amet, do, do, dolor, elit, elit, ipsum, sit, sit
7 adipisicing, consecteur, do, do, dolor, elit, elit, elit, ipsum, sit
8 adipisicing, consecteur, do, do, do, dolor, elit, ipsum, sit, sit
9 amet, consecteur, consecteur, do, do, elit, ipsum, lorem, sit, sit
10 adipisicing, consecteur, do, elit, ipsum, sit, sit, sit, sit, sit
10 rows fetched in 0.0004s (0.0044s)
Sort  (cost=27.05..27.08 rows=10 width=4)
  Sort Key: gd.grouper
  ->  Subquery Scan gd  (cost=2.25..26.89 rows=10 width=4)
        ->  HashAggregate  (cost=2.25..2.35 rows=10 width=4)
              ->  Seq Scan on t_grouper  (cost=0.00..2.00 rows=100 width=4)
        SubPlan
          ->  Sort  (cost=2.42..2.44 rows=10 width=6)
                Sort Key: gi.value
                ->  Seq Scan on t_grouper gi  (cost=0.00..2.25 rows=10 width=6)
                      Filter: (grouper = $0)

If we want to avoid duplicates in the concatenated strings, we need to add DISTINCT or GROUP BY clause into the subquery:

SELECT  grouper,
        array_to_string
        (
        ARRAY   (
                SELECT  DISTINCT value
                FROM    t_grouper gi
                WHERE   gi.grouper = gd.grouper
                ORDER BY
                        value
                ),
        ', '
        ) AS group_concat
FROM    (
        SELECT  grouper
        FROM    t_grouper
        GROUP BY
                grouper
        ) gd
ORDER BY
        grouper
grouper group_concat
1 adipisicing, amet, consecteur, ipsum, lorem, sit
2 adipisicing, amet, dolor, ipsum, lorem, sit
3 adipisicing, amet, consecteur, do, dolor, elit, sit
4 adipisicing, do, dolor, lorem, sed, sit
5 adipisicing, consecteur, do, elit, ipsum, lorem, sit
6 adipisicing, amet, do, dolor, elit, ipsum, sit
7 adipisicing, consecteur, do, dolor, elit, ipsum, sit
8 adipisicing, consecteur, do, dolor, elit, ipsum, sit
9 amet, consecteur, do, elit, ipsum, lorem, sit
10 adipisicing, consecteur, do, elit, ipsum, sit
10 rows fetched in 0.0004s (0.0049s)
Sort  (cost=27.30..27.33 rows=10 width=4)
  Sort Key: gd.grouper
  ->  Subquery Scan gd  (cost=2.25..27.14 rows=10 width=4)
        ->  HashAggregate  (cost=2.25..2.35 rows=10 width=4)
              ->  Seq Scan on t_grouper  (cost=0.00..2.00 rows=100 width=4)
        SubPlan
          ->  Unique  (cost=2.42..2.47 rows=1 width=6)
                ->  Sort  (cost=2.42..2.44 rows=10 width=6)
                      Sort Key: gi.value
                      ->  Seq Scan on t_grouper gi  (cost=0.00..2.25 rows=10 width=6)
                            Filter: (grouper = $0)

Note that is there are lots of DISTINCT values in a subquery, DISTINCT is more efficient for queries with ORDER BY, while GROUP BY is more efficient for unordered queries.

That's because the former one uses Sort to eliminate non-UNIQUE values (and therefore ORDER BY is free), while the latter one uses HashAggregate, which is more efficient but the values come unsorted.

If there are few DISTINCT values, then the sorting is easy on the final, DISTINCT rowset, and HashAggregate efficiency overweights the advantages of presorted rows.

In this case, you should use GROUP BY even with ORDER BY.

One final note: this query is useful only for simple cases, where you can construct a simple subquery for all groups.

If the query is complex (like, it uses WHERE conditions or JOIN's), you'll need to propagate the conditions into the subquery too, and this can kill all performance and readability benefits.

In this case, it's better to create an aggregate function (as described in Aggregate concatenation) and use it with GROUP BY

Written by Quassnoi

May 2nd, 2009 at 11:00 pm

Posted in PostgreSQL

One Response to 'GROUP_CONCAT in PostgreSQL without aggregate functions'

Subscribe to comments with RSS

  1. Great !!
    Thank you very much, it’s very useful and very well explained !!

    Valerie Roux

    10 Dec 12 at 16:04

Leave a Reply