Longest common prefix: SQL Server
Comments enabled. I *really* need your comment
From Stack Overflow:
I have some data:
id ref 1 3536757616 1 3536757617 1 3536757618 2 3536757628 2 3536757629 2 3536757630 and want to get the result like this:
id result 1 3536757616/7/8 2 3536757629/28/30 Essentially, the data should be aggregated on
id
, and theref
's should be concatenated together and separated by a/
(slash), but with longest common prefix removed.
I've already encoutered this problem several times, so I'll try to cover solutions for all RDBMS's my blog is about:
- Longest common prefix: SQL Server
- Longest common prefix: PostgreSQL
- Longest common prefix: Oracle
- Longest common prefix: MySQL
I hope this will be interesting as approaches will differ significantly for all four systems.
Today is SQL Server time.
I won't create sample tables here, since I'm demonstrating the principle. Instead, I'll just use dynamically generated data.
What do we need to do here is:
- Find least common prefix for each group and its length
- Cut off the prefix of each but the first
ref
, usingSUBSTRING
- Concatenate the strings using
FOR XML
Steps 2 and 3 are quite simple, but the first one needs some effort.
SQL Server doesn't provide a function to find the longest common prefix (LCP), so we'll have to implement it.
With some limitations, it may be done using pure SQL
, no UDF
's.
It's easily provable that the LCP for a set of strings is the shortest of the LCP's for any given string from the set and all other strings.
I. e., if we take a single string from the set, find its LCP's for all other strings from the set, and take the minimal one, this will be LCP of the whole set.
So we need to select a single ref
out of each group, compare it to all other ref
's from the same group, and find the shortest LCP
.
But how do we find an LCP
for a pair of strings?
We could easily do it in a loop in any procedural language, but SQL
is not procedural. However, we can implement it using pure set operations.
We'll perform the following steps:
- Generate a set of integers from 1 to 100 using a CTE
- Take the prefixes of both strings of corresponding length, using
LEFT
- Compare the prefixes and filter the row out if they don't match
- Return the maximal length of prefixes that matched
Here's the query to do that:
WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ) SELECT MAX(num) FROM series WHERE LEFT('abcDEF', num) = LEFT('abcGHI', num)
3 |
This query returns the greatest num
for which LEFT(num)
of both strings are equal.
Now, we need to take and arbitrary string from each group, pair it with all other strings from the same group, and find LCP's for each pair:
WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q
id | initref | ref | lcp |
---|---|---|---|
1 | 3536757616 | 3536757616 | 10 |
1 | 3536757616 | 3536757617 | 9 |
1 | 3536757616 | 3536757618 | 9 |
2 | 3536757628 | 3536757628 | 10 |
2 | 3536757628 | 3536757629 | 9 |
2 | 3536757628 | 3536757630 | 8 |
This query takes first row from each set (it is returned as initref
) and checks it against all rows from the set (including itself).
For each pair, the LCP
length is returned.
Now, we just need to find groupwise minimums of LCP
lengths for the pairs. This will be the LCP
for the whole group:
WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, MIN(lcp) FROM ( SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q ) q2 GROUP BY id
id | |
---|---|
1 | 9 |
2 | 8 |
We just GROUP
'ed the results of the initial query by id
and found the minimum.
Now, we know the length of LCP
for each group.
Everything that is left to do is to cut off the prefix of every but the first string within each group, and concatenate the results.
This is best done in a SELECT
clause subquery using FOR XML
clause:
WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, ( SELECT CASE WHEN ROW_NUMBER() OVER (ORDER BY ref) = 1 THEN ref ELSE '/' + RIGHT(ref, LEN(ref) - minlcp) END FROM data dx WHERE id = q3.id FOR XML PATH('') ) AS concat FROM ( SELECT id, MIN(lcp) AS minlcp FROM ( SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q ) q2 GROUP BY id ) q3
id | concat |
---|---|
1 | 3536757616/7/8 |
2 | 3536757628/29/30 |
Here we are, everything is concatenated nicely.
To be continued.