Longest common prefix: PostgreSQL at EXPLAIN EXTENDED

Longest common prefix: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on how to strip all strings in a set of their longest common prefix and concatenate the results:

Today, I'll show how to do it in PostgreSQL.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref

1 3536757616

1 3536757617

1 3536757618

2 3536757628

2 3536757629

2 3536757630

and want to get the result like this:

id result

1 3536757616/7/8

2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

id	ref
1	3536757616
1	3536757617
1	3536757618
2	3536757628
2	3536757629
2	3536757630

id	result
1	3536757616/7/8
2	3536757629/28/30

Like with SQL Server, it is possible to do this in PostgreSQL using a single SQL query.

But since PostgreSQL offers a nice ability to create custom aggregates, I'll better demonstrate how to solve this task using ones.

In my opinion, custom aggregates fit here just perfectly.

Since PostgreSQL lacks native support for aggregate concatenation, we will need to create two custom aggregates here:

LCP(TEXT) will find the longest common prefix (LCP) for a set of strings
GROUP_CONCAT(TEXT, TEXT) will concatenate strings using given delimiter

To implement LCP we will exploit PostgreSQL's ability to generate a series of integers, given lower and upper bounds.

Given two strings, we will increment the prefix length by one and compare the prefixes in a loop.

We need to return the prefix one char shorter than the first prefixes that didn't match.

If all prefixes match, we will just return the shortest of the strings.

To implement a custom aggregate in PostgreSQL, we normally need two functions: first one (SFUNC) iterates the values and saves the result in a state variable, while the second one (FINALFUNC) processes the state variable and returns the aggregated value.

If the FINALFUNC is not defined for an aggregate, it just returns the last value of the state variable as is.

Since the result we need is the last value of the state variable, we don't need a FINALFUNC here.

Here's the SFUNC and the aggregate:

CREATE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
        SELECT  SUBSTRING($2, 1, s - 1)
        FROM    generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
        WHERE   SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
        UNION ALL
        SELECT  LEAST($1, $2)
        LIMIT 1;
$$
LANGUAGE 'sql';

CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);

Now, we need to implement GROUP_CONCAT.

This article describes how to do this in PostgreSQL:

Aggregate concatenation

, however, the function described there is an overkill for our task, since it accepts an extra parameter, IS_DISTINCT, which we don't need here.

Instead, we'll implement simpler function, which is nevertheless useful per se. It also accepts a custom delimiter.

We will append the strings to an array in the iteration function (SFUNC) and concatenate them with array_to_string in the finalizing function (FINALFUNC).

Unfortunately, FINALFUNC has access only to the state variable, that means we will have to store the delimiter somewhere in the state.

We will use the first value of the array to do this, and slice the array before concatenating in the end.

Here are the functions and the aggregate:

CREATE FUNCTION group_concat_iterate(_state TEXT[], value TEXT, delimiter TEXT)
RETURNS TEXT[][]
AS
$$
        SELECT  COALESCE($1, ARRAY[$3]) || $2;
$$
LANGUAGE 'sql' VOLATILE;

CREATE FUNCTION group_concat_finish(_state TEXT[])
RETURNS TEXT
AS
$$
        SELECT  array_to_string($1[2:array_upper($1, 1)], $1[1])
$$
LANGUAGE 'sql' VOLATILE;

CREATE AGGREGATE group_concat(text, text) (SFUNC = group_concat_iterate, STYPE = TEXT[], FINALFUNC = group_concat_finish);

Both functions are nice one-liners.

Finally, we just need to use these aggregates together.

Since PostgreSQL doesn't yet support WITH clause (as of 8.3), I'll create a table with sample data:

CREATE TABLE t_data (id INT NOT NULL, ref VARCHAR(20) NOT NULL);

INSERT
INTO    t_data
VALUES
        (1, '3536757616'),
        (1, '3536757617'),
        (1, '3536757618'),
        (2, '3536757628'),
        (2, '3536757629'),
        (2, '3536757630');

and use it in the following query:

SELECT  id, LCP(ref) ||
        (
        SELECT  group_concat(SUBSTRING(ref, LENGTH(LCP(d.ref)) + 1, LENGTH(ref)), '/')
        FROM    t_data di
        WHERE   di.id = d.id
        )
FROM    t_data d
GROUP BY
        id
ORDER BY
        id

id	?column?
1	3536757616/7/8
2	3536757628/29/30
2 rows fetched in 0.0002s (0.0045s)

Sort  (cost=4243.04..4243.54 rows=200 width=62)
  Sort Key: d.id
  ->  HashAggregate  (cost=25.40..4235.40 rows=200 width=62)
        ->  Seq Scan on t_data d  (cost=0.00..18.80 rows=880 width=62)
        SubPlan
          ->  Aggregate  (cost=21.01..21.03 rows=1 width=58)
                ->  Seq Scan on t_data di  (cost=0.00..21.00 rows=4 width=58)
                      Filter: (id = $1)

To me, it's beautiful.

And for those who is too lazy to create aggregates or cannot do it for any reason, here's the pure SQL version of the query:

SELECT  id,
        lcp || array_to_string
        (
        ARRAY
        (
        SELECT  SUBSTRING(ref, LENGTH(lcp) + 1, LENGTH(ref))
        FROM    t_data di
        WHERE   di.id = dl.id
        ), '/'
        ) AS result
FROM    (
        SELECT  id,
                MIN
                (
                (
                SELECT  SUBSTRING(initref, 1, s - 1)
                FROM    generate_series(1, LEAST(LENGTH(initref), LENGTH(ref))) s
                WHERE   SUBSTRING(initref, 1, s) <> SUBSTRING(ref, 1, s)
                UNION ALL
                SELECT  LEAST(initref, ref)
                LIMIT 1
                )
                ) AS lcp
        FROM    (
                SELECT  id, ref,
                        (
                        SELECT  ref
                        FROM    t_data di
                        WHERE   di.id = d.id
                        LIMIT 1
                        ) AS initref
                FROM    t_data d
                ) dd
        GROUP BY 
                id
        ) dl
ORDER BY
        id

id	result
1	3536757616/7/8
2	3536757628/29/30
2 rows fetched in 0.0002s (0.0044s)

Sort  (cost=7403.42..7403.92 rows=200 width=36)
  Sort Key: dl.id
  ->  Subquery Scan dl  (cost=23.20..7395.78 rows=200 width=36)
        ->  HashAggregate  (cost=23.20..3184.78 rows=200 width=62)
              ->  Seq Scan on t_data d  (cost=0.00..18.80 rows=880 width=62)
              SubPlan
                ->  Limit  (cost=15.75..15.80 rows=1 width=4)
                      ->  Append  (cost=15.75..60.95 rows=996 width=4)
                            ->  Function Scan on generate_series s  (cost=15.75..45.73 rows=995 width=4)
                                  Filter: ("substring"(($6)::text, 1, s) <> "substring"(($4)::text, 1, s))
                                  InitPlan
                                    ->  Limit  (cost=0.00..5.25 rows=1 width=58)
                                          ->  Seq Scan on t_data di  (cost=0.00..21.00 rows=4 width=58)
                                                Filter: (id = $2)
                                    ->  Limit  (cost=0.00..5.25 rows=1 width=58)
                                          ->  Seq Scan on t_data di  (cost=0.00..21.00 rows=4 width=58)
                                                Filter: (id = $2)
                                    ->  Limit  (cost=0.00..5.25 rows=1 width=58)
                                          ->  Seq Scan on t_data di  (cost=0.00..21.00 rows=4 width=58)
                                                Filter: (id = $2)
                            ->  Subquery Scan "*SELECT* 2"  (cost=5.25..5.27 rows=1 width=0)
                                  ->  Result  (cost=5.25..5.26 rows=1 width=0)
                                        InitPlan
                                          ->  Limit  (cost=0.00..5.25 rows=1 width=58)
                                                ->  Seq Scan on t_data di  (cost=0.00..21.00 rows=4 width=58)
                                                      Filter: (id = $2)
        SubPlan
          ->  Seq Scan on t_data di  (cost=0.00..21.04 rows=4 width=58)
                Filter: (id = $1)

This is bulky and less elegant than using aggregates, but nevertheless usable and performant.

To be continued.

Written by Quassnoi

June 5th, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED