EXPLAIN EXTENDED

How to create fast database queries

Aggregate concatenation

Comments enabled. I *really* need your comment

Aggregate concatenation functions help creating a concatenated list out of a recordset. Useful for reports, hierarchical trees, etc.

MySQL supplies GROUP_CONCAT for this purpose. SYS_CONNECT_BY PATH and FOR XML can be used in Oracle and MS SQL.

In PostgreSQL, we cannot use these tricks, but we can create our own aggregate function. And this function will also accept two more extremely useful parameters: DELIMITER and IS_DISTINCT.

DELIMITER is self-explanatory: it will be used to separate the values. IS_DISTINCT, on the other hard, is more tricky: it will be evaluated on each step, and will be used to decide if this very value should be filtered out.

Let's create the functions and see how they work:

CREATE TYPE "20090304_groupconcat".tp_concat AS (data TEXT[], delimiter TEXT);

CREATE OR REPLACE FUNCTION "20090304_groupconcat".group_concat_iterate(_state "20090304_groupconcat".tp_concat, _value TEXT, delimiter TEXT, is_distinct boolean)
RETURNS "20090304_groupconcat".tp_concat AS
$BODY$
SELECT
CASE
WHEN $1 IS NULL THEN ARRAY[$2]
WHEN $4 AND $1.data @> ARRAY[$2] THEN $1.data
ELSE $1.data || $2
END,
$3
$BODY$
LANGUAGE 'sql' VOLATILE;

CREATE OR REPLACE FUNCTION "20090304_groupconcat".group_concat_finish(_state "20090304_groupconcat".tp_concat)
RETURNS text AS
$BODY$
SELECT array_to_string($1.data, $1.delimiter)
$BODY$
LANGUAGE 'sql' VOLATILE;

CREATE AGGREGATE "20090304_groupconcat".group_concat(text, text, boolean) (SFUNC = "20090304_groupconcat".group_concat_iterate, STYPE = "20090304_groupconcat".tp_concat, FINALFUNC = "20090304_groupconcat".group_concat_finish);

CREATE TABLE "20090304_groupconcat".t_exception (exp_id INT NOT NULL PRIMARY KEY, exp_word TEXT NOT NULL);

As you can see, we utilize a custom datatype here. We need to store DELIMITER along with intermediate aggregated data, as it's the only data FINALFUNC has access to. We also use array operations to store data and search for duplicates.

IS_DISTINCT operation is best illustrated with this query:

SELECT   "20090303_groupconcat".group_concat(val, ', ', val = 'grouped')
FROM   (
SELECT  'grouped' AS val
UNION ALL
SELECT  'grouped'
UNION ALL
SELECT  'ungrouped'
UNION ALL
SELECT  'ungrouped'
) g;
group_concat
grouped, ungrouped, ungrouped

The aggregate groups all records that have value grouped and leaves all other records as is.

When using this aggregate against the indexed column on large datasets, it's better to use DISTINCT clause in an uncorrelated subquery. This aggregate will not utilize any indexes on the aggregated column, while DISTINCT will:

Table creation scripts

SELECT    "20090304_groupconcat".group_concat(exp_word, ', ', TRUE)
FROM   (
SELECT  DISTINCT exp_word
FROM    t_exception
) e;
group_concat
asper, gener, liber, miser, puer, socer, tener, vesper
Time: 1213.252 ms

When there is no index or it's not used, the aggregate is of same performance as DISTINCT clause:

SET ENABLE_INDEXSCAN=OFF;
SELECT   "20090304_groupconcat".group_concat(exp_word, ', ', TRUE)
FROM   (
SELECT  DISTINCT exp_word
FROM    "20090304_groupconcat".t_exception
) e;
SET ENABLE_INDEXSCAN=ON;
group_concat
asper, gener, liber, miser, puer, socer, tener, vesper
Time: 14391.568 ms
SET ENABLE_INDEXSCAN=OFF;
SELECT     "20090304_groupconcat".group_concat(exp_word, ', ', TRUE)
FROM   "20090304_groupconcat".t_exception;
SET ENABLE_INDEXSCAN=ON;
group_concat
puer, socer, vesper, gener, asper, liber, miser, tener
Time: 14324.554 ms

One final note: the aggregate outputs values in order it receives the first instance of each value.

It you need any specific order, use ORDER BY in an uncorrelated subquery.

Written by Quassnoi

March 4th, 2009 at 9:00 pm

Posted in PostgreSQL

Leave a Reply