## Longest common prefix: SQL Server

Comments enabled. I *really* need your comment

From **Stack Overflow**:

I have some data:

id ref 1 3536757616 1 3536757617 1 3536757618 2 3536757628 2 3536757629 2 3536757630 and want to get the result like this:

id result 1 3536757616/7/8 2 3536757629/28/30 Essentially, the data should be aggregated on

`id`

, and the`ref`

's should be concatenated together and separated by a`/`

(slash), but with longest common prefix removed.

I've already encoutered this problem several times, so I'll try to cover solutions for all **RDBMS**'s my blog is about:

**Longest common prefix: SQL Server****Longest common prefix: PostgreSQL****Longest common prefix: Oracle****Longest common prefix: MySQL**

I hope this will be interesting as approaches will differ significantly for all four systems.

Today is **SQL Server** time.

I won't create sample tables here, since I'm demonstrating the principle. Instead, I'll just use dynamically generated data.

What do we need to do here is:

- Find least common prefix for each group and its length
- Cut off the prefix of each but the first
`ref`

, using`SUBSTRING`

- Concatenate the strings using
`FOR XML`

Steps **2** and **3** are quite simple, but the first one needs some effort.

**SQL Server** doesn't provide a function to find the longest common prefix (**LCP**), so we'll have to implement it.

With some limitations, it may be done using pure `SQL`

, no `UDF`

's.

It's easily provable that the **LCP** for a set of strings is the shortest of the **LCP**'s for any given string from the set and all other strings.

I. e., if we take a single string from the set, find its **LCP**'s for all other strings from the set, and take the minimal one, this will be **LCP** of the whole set.

So we need to select a single `ref`

out of each group, compare it to all other `ref`

's from the same group, and find the shortest `LCP`

.

But how do we find an `LCP`

for a pair of strings?

We could easily do it in a loop in any procedural language, but `SQL`

is not procedural. However, we can implement it using pure set operations.

We'll perform the following steps:

- Generate a set of integers from
**1**to**100**using a**CTE** - Take the prefixes of both strings of corresponding length, using
`LEFT`

- Compare the prefixes and filter the row out if they don't match
- Return the maximal length of prefixes that matched

Here's the query to do that:

WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ) SELECT MAX(num) FROM series WHERE LEFT('abcDEF', num) = LEFT('abcGHI', num)

3 |

This query returns the greatest `num`

for which `LEFT(num)`

of both strings are equal.

Now, we need to take and arbitrary string from each group, pair it with all other strings from the same group, and find **LCP**'s for each pair:

WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q

id | initref | ref | lcp |
---|---|---|---|

1 | 3536757616 | 3536757616 | 10 |

1 | 3536757616 | 3536757617 | 9 |

1 | 3536757616 | 3536757618 | 9 |

2 | 3536757628 | 3536757628 | 10 |

2 | 3536757628 | 3536757629 | 9 |

2 | 3536757628 | 3536757630 | 8 |

This query takes first row from each set (it is returned as `initref`

) and checks it against all rows from the set (including itself).

For each pair, the `LCP`

length is returned.

Now, we just need to find groupwise minimums of `LCP`

lengths for the pairs. This will be the `LCP`

for the whole group:

WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, MIN(lcp) FROM ( SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q ) q2 GROUP BY id

id | |
---|---|

1 | 9 |

2 | 8 |

We just `GROUP`

'ed the results of the initial query by `id`

and found the minimum.

Now, we know the length of `LCP`

for each group.

Everything that is left to do is to cut off the prefix of every but the first string within each group, and concatenate the results.

This is best done in a `SELECT`

clause subquery using `FOR XML`

clause:

WITH series (num) AS ( SELECT 1 UNION ALL SELECT num + 1 FROM series WHERE num <= 100 ), data AS ( SELECT 1 AS id, '3536757616' AS ref UNION ALL SELECT 1 AS id, '3536757617' AS ref UNION ALL SELECT 1 AS id, '3536757618' AS ref UNION ALL SELECT 2 AS id, '3536757628' AS ref UNION ALL SELECT 2 AS id, '3536757629' AS ref UNION ALL SELECT 2 AS id, '3536757630' AS ref ) SELECT id, ( SELECT CASE WHEN ROW_NUMBER() OVER (ORDER BY ref) = 1 THEN ref ELSE '/' + RIGHT(ref, LEN(ref) - minlcp) END FROM data dx WHERE id = q3.id FOR XML PATH('') ) AS concat FROM ( SELECT id, MIN(lcp) AS minlcp FROM ( SELECT id, initref, ref, ( SELECT MAX(num) FROM series WHERE LEFT(initref, num) = LEFT(ref, num) AND num <= LEN(initref) AND num <= LEN(ref) ) AS lcp FROM ( SELECT id, ( SELECT TOP 1 ref FROM data di WHERE di.id = d.id ) AS initref, ref FROM data d ) q ) q2 GROUP BY id ) q3

id | concat |
---|---|

1 | 3536757616/7/8 |

2 | 3536757628/29/30 |

Here we are, everything is concatenated nicely.

**To be continued.**