Longest common prefix: PostgreSQL
Comments enabled. I *really* need your comment
This is a series of articles on how to strip all strings in a set of their longest common prefix and concatenate the results:
- Longest common prefix: SQL Server
- Longest common prefix: PostgreSQL
- Longest common prefix: Oracle
- Longest common prefix: MySQL
Today, I'll show how to do it in PostgreSQL.
A quick reminder of the problem (taken from Stack Overflow):
I have some data:
id ref 1 3536757616 1 3536757617 1 3536757618 2 3536757628 2 3536757629 2 3536757630 and want to get the result like this:
id result 1 3536757616/7/8 2 3536757629/28/30 Essentially, the data should be aggregated on
id
, and theref
's should be concatenated together and separated by a/
(slash), but with longest common prefix removed.
Like with SQL Server, it is possible to do this in PostgreSQL using a single SQL query.
But since PostgreSQL offers a nice ability to create custom aggregates, I'll better demonstrate how to solve this task using ones.
In my opinion, custom aggregates fit here just perfectly.
Since PostgreSQL lacks native support for aggregate concatenation, we will need to create two custom aggregates here:
LCP(TEXT)
will find the longest common prefix (LCP) for a set of stringsGROUP_CONCAT(TEXT, TEXT)
will concatenate strings using given delimiter
To implement LCP
we will exploit PostgreSQL's ability to generate a series of integers, given lower and upper bounds.
Given two strings, we will increment the prefix length by one and compare the prefixes in a loop.
We need to return the prefix one char shorter than the first prefixes that didn't match.
If all prefixes match, we will just return the shortest of the strings.
To implement a custom aggregate in PostgreSQL, we normally need two functions: first one (SFUNC
) iterates the values and saves the result in a state
variable, while the second one (FINALFUNC
) processes the state variable and returns the aggregated value.
If the FINALFUNC
is not defined for an aggregate, it just returns the last value of the state variable as is.
Since the result we need is the last value of the state variable, we don't need a FINALFUNC
here.
Here's the SFUNC
and the aggregate:
CREATE FUNCTION lcp_iterate(_state TEXT, value TEXT) RETURNS TEXT AS $$ SELECT SUBSTRING($2, 1, s - 1) FROM generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s WHERE SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s) UNION ALL SELECT LEAST($1, $2) LIMIT 1; $$ LANGUAGE 'sql'; CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
Now, we need to implement GROUP_CONCAT
.
This article describes how to do this in PostgreSQL
:
, however, the function described there is an overkill for our task, since it accepts an extra parameter, IS_DISTINCT
, which we don't need here.
Instead, we'll implement simpler function, which is nevertheless useful per se. It also accepts a custom delimiter.
We will append the strings to an array in the iteration function (SFUNC
) and concatenate them with array_to_string
in the finalizing function (FINALFUNC
).
Unfortunately, FINALFUNC
has access only to the state variable, that means we will have to store the delimiter somewhere in the state.
We will use the first value of the array to do this, and slice the array before concatenating in the end.
Here are the functions and the aggregate:
CREATE FUNCTION group_concat_iterate(_state TEXT[], value TEXT, delimiter TEXT) RETURNS TEXT[][] AS $$ SELECT COALESCE($1, ARRAY[$3]) || $2; $$ LANGUAGE 'sql' VOLATILE; CREATE FUNCTION group_concat_finish(_state TEXT[]) RETURNS TEXT AS $$ SELECT array_to_string($1[2:array_upper($1, 1)], $1[1]) $$ LANGUAGE 'sql' VOLATILE; CREATE AGGREGATE group_concat(text, text) (SFUNC = group_concat_iterate, STYPE = TEXT[], FINALFUNC = group_concat_finish);
Both functions are nice one-liners.
Finally, we just need to use these aggregates together.
Since PostgreSQL doesn't yet support WITH
clause (as of 8.3), I'll create a table with sample data:
CREATE TABLE t_data (id INT NOT NULL, ref VARCHAR(20) NOT NULL); INSERT INTO t_data VALUES (1, '3536757616'), (1, '3536757617'), (1, '3536757618'), (2, '3536757628'), (2, '3536757629'), (2, '3536757630');
and use it in the following query:
SELECT id, LCP(ref) || ( SELECT group_concat(SUBSTRING(ref, LENGTH(LCP(d.ref)) + 1, LENGTH(ref)), '/') FROM t_data di WHERE di.id = d.id ) FROM t_data d GROUP BY id ORDER BY id
id | ?column? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3536757616/7/8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 3536757628/29/30 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0002s (0.0045s) |
Sort (cost=4243.04..4243.54 rows=200 width=62) Sort Key: d.id -> HashAggregate (cost=25.40..4235.40 rows=200 width=62) -> Seq Scan on t_data d (cost=0.00..18.80 rows=880 width=62) SubPlan -> Aggregate (cost=21.01..21.03 rows=1 width=58) -> Seq Scan on t_data di (cost=0.00..21.00 rows=4 width=58) Filter: (id = $1)
To me, it's beautiful.
And for those who is too lazy to create aggregates or cannot do it for any reason, here's the pure SQL
version of the query:
SELECT id, lcp || array_to_string ( ARRAY ( SELECT SUBSTRING(ref, LENGTH(lcp) + 1, LENGTH(ref)) FROM t_data di WHERE di.id = dl.id ), '/' ) AS result FROM ( SELECT id, MIN ( ( SELECT SUBSTRING(initref, 1, s - 1) FROM generate_series(1, LEAST(LENGTH(initref), LENGTH(ref))) s WHERE SUBSTRING(initref, 1, s) <> SUBSTRING(ref, 1, s) UNION ALL SELECT LEAST(initref, ref) LIMIT 1 ) ) AS lcp FROM ( SELECT id, ref, ( SELECT ref FROM t_data di WHERE di.id = d.id LIMIT 1 ) AS initref FROM t_data d ) dd GROUP BY id ) dl ORDER BY id
id | result | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3536757616/7/8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 3536757628/29/30 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0002s (0.0044s) |
Sort (cost=7403.42..7403.92 rows=200 width=36) Sort Key: dl.id -> Subquery Scan dl (cost=23.20..7395.78 rows=200 width=36) -> HashAggregate (cost=23.20..3184.78 rows=200 width=62) -> Seq Scan on t_data d (cost=0.00..18.80 rows=880 width=62) SubPlan -> Limit (cost=15.75..15.80 rows=1 width=4) -> Append (cost=15.75..60.95 rows=996 width=4) -> Function Scan on generate_series s (cost=15.75..45.73 rows=995 width=4) Filter: ("substring"(($6)::text, 1, s) <> "substring"(($4)::text, 1, s)) InitPlan -> Limit (cost=0.00..5.25 rows=1 width=58) -> Seq Scan on t_data di (cost=0.00..21.00 rows=4 width=58) Filter: (id = $2) -> Limit (cost=0.00..5.25 rows=1 width=58) -> Seq Scan on t_data di (cost=0.00..21.00 rows=4 width=58) Filter: (id = $2) -> Limit (cost=0.00..5.25 rows=1 width=58) -> Seq Scan on t_data di (cost=0.00..21.00 rows=4 width=58) Filter: (id = $2) -> Subquery Scan "*SELECT* 2" (cost=5.25..5.27 rows=1 width=0) -> Result (cost=5.25..5.26 rows=1 width=0) InitPlan -> Limit (cost=0.00..5.25 rows=1 width=58) -> Seq Scan on t_data di (cost=0.00..21.00 rows=4 width=58) Filter: (id = $2) SubPlan -> Seq Scan on t_data di (cost=0.00..21.04 rows=4 width=58) Filter: (id = $1)
This is bulky and less elegant than using aggregates, but nevertheless usable and performant.
To be continued.