EXPLAIN EXTENDED

How to create fast database queries

Longest common prefix: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

I've already encoutered this problem several times, so I'll try to cover solutions for all RDBMS's my blog is about:

I hope this will be interesting as approaches will differ significantly for all four systems.

Today is SQL Server time.

I won't create sample tables here, since I'm demonstrating the principle. Instead, I'll just use dynamically generated data.

What do we need to do here is:

  1. Find least common prefix for each group and its length
  2. Cut off the prefix of each but the first ref, using SUBSTRING
  3. Concatenate the strings using FOR XML

Steps 2 and 3 are quite simple, but the first one needs some effort.

SQL Server doesn't provide a function to find the longest common prefix (LCP), so we'll have to implement it.

With some limitations, it may be done using pure SQL, no UDF's.

It's easily provable that the LCP for a set of strings is the shortest of the LCP's for any given string from the set and all other strings.

I. e., if we take a single string from the set, find its LCP's for all other strings from the set, and take the minimal one, this will be LCP of the whole set.

So we need to select a single ref out of each group, compare it to all other ref's from the same group, and find the shortest LCP.

But how do we find an LCP for a pair of strings?

We could easily do it in a loop in any procedural language, but SQL is not procedural. However, we can implement it using pure set operations.

We'll perform the following steps:

  1. Generate a set of integers from 1 to 100 using a CTE
  2. Take the prefixes of both strings of corresponding length, using LEFT
  3. Compare the prefixes and filter the row out if they don't match
  4. Return the maximal length of prefixes that matched

Here's the query to do that:

WITH    series (num) AS
(
SELECT  1
UNION ALL
SELECT  num + 1
FROM    series
WHERE   num <= 100
        )
SELECT  MAX(num)
FROM    series
WHERE   LEFT('abcDEF', num) = LEFT('abcGHI', num)

3

This query returns the greatest num for which LEFT(num) of both strings are equal.

Now, we need to take and arbitrary string from each group, pair it with all other strings from the same group, and find LCP's for each pair:

WITH    series (num) AS
(
SELECT  1
UNION ALL
SELECT  num + 1
FROM    series
WHERE   num <= 100
        ),
        data AS
        (
        SELECT  1 AS id, '3536757616' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757617' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757618' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757628' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757629' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757630' AS ref
        )
SELECT  id, initref, ref,
        (
        SELECT  MAX(num)
        FROM    series
        WHERE   LEFT(initref, num) = LEFT(ref, num)
                AND num <= LEN(initref)
                AND num <= LEN(ref)
        ) AS lcp
FROM    (
        SELECT  id,
                (
                SELECT  TOP 1 ref
                FROM    data di
                WHERE   di.id = d.id
                ) AS initref,
                ref
        FROM    data d
        ) q

id initref ref lcp
1 3536757616 3536757616 10
1 3536757616 3536757617 9
1 3536757616 3536757618 9
2 3536757628 3536757628 10
2 3536757628 3536757629 9
2 3536757628 3536757630 8

This query takes first row from each set (it is returned as initref) and checks it against all rows from the set (including itself).

For each pair, the LCP length is returned.

Now, we just need to find groupwise minimums of LCP lengths for the pairs. This will be the LCP for the whole group:

WITH    series (num) AS
(
SELECT  1
UNION ALL
SELECT  num + 1
FROM    series
WHERE   num <= 100
        ),
        data AS
        (
        SELECT  1 AS id, '3536757616' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757617' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757618' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757628' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757629' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757630' AS ref
        )
SELECT  id, MIN(lcp)
FROM    (
        SELECT  id, initref, ref,
                (
                SELECT  MAX(num)
                FROM    series
                WHERE   LEFT(initref, num) = LEFT(ref, num)
                        AND num <= LEN(initref)
                        AND num <= LEN(ref)
                ) AS lcp
        FROM    (
                SELECT  id,
                        (
                        SELECT  TOP 1 ref
                        FROM    data di
                        WHERE   di.id = d.id
                        ) AS initref,
                        ref
                FROM    data d
                ) q
        ) q2
GROUP BY
        id

id
1 9
2 8

We just GROUP'ed the results of the initial query by id and found the minimum.

Now, we know the length of LCP for each group.

Everything that is left to do is to cut off the prefix of every but the first string within each group, and concatenate the results.

This is best done in a SELECT clause subquery using FOR XML clause:

WITH    series (num) AS
(
SELECT  1
UNION ALL
SELECT  num + 1
FROM    series
WHERE   num <= 100
        ),
        data AS
        (
        SELECT  1 AS id, '3536757616' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757617' AS ref
        UNION ALL
        SELECT  1 AS id, '3536757618' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757628' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757629' AS ref
        UNION ALL
        SELECT  2 AS id, '3536757630' AS ref
        )
SELECT  id,
        (
        SELECT  CASE WHEN ROW_NUMBER() OVER (ORDER BY ref) = 1 THEN ref ELSE '/' + RIGHT(ref, LEN(ref) - minlcp) END
        FROM    data dx
        WHERE   id = q3.id
        FOR XML PATH('')
        ) AS concat
FROM    (
        SELECT  id, MIN(lcp) AS minlcp
        FROM    (
                SELECT  id, initref, ref,
                        (
                        SELECT  MAX(num)
                        FROM    series
                        WHERE   LEFT(initref, num) = LEFT(ref, num)
                                AND num <= LEN(initref)
                                AND num <= LEN(ref)
                        ) AS lcp
                FROM    (
                        SELECT  id,
                                (
                                SELECT  TOP 1 ref
                                FROM    data di
                                WHERE   di.id = d.id
                                ) AS initref,
                                ref
                        FROM    data d
                        ) q
                ) q2
        GROUP BY
                id
        ) q3
id concat
1 3536757616/7/8
2 3536757628/29/30

Here we are, everything is concatenated nicely.

To be continued.

Written by Quassnoi

June 4th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply