EXPLAIN EXTENDED

How to create fast database queries

Longest common prefix: MySQL

Comments enabled. I *really* need your comment

This is the fourth and final article of the series on how to strip all strings in a set of their longest common prefix and concatenate the results:

, and today we will solve this task for MySQL.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

MySQL solution differs from the previous ones a little.

On the one hand, MySQL is the only RDBMS of all listed above that supplies a built-in GROUP_CONCAT (aggregate function to concatenate strings).

On the other hand, MySQL lacks a way to generate an arbitrary resultset which we relied upon in all solutions above.

This makes it easier to concatenate the results, but a lot harder to find the longest common prefix.

Fortunately, there is a way to do it.

Since 5.1, MySQL offers an XPath manipulation function, namely ExtractValue, that will help us to solve this task.

Unfortunately, ExtractValue is not set generating, i. e. it cannot create an SQL resultset out of an XML nodeset. But XPath itself is quite a powerful tool, and we can implement the longest common prefix algorithm completely in XPath.

First, for each ref in the data above, we should generate an XML nodeset that will contain as many nodes as there are characters in the given ref.

It should look like this:

<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>
<e>3536757616</e>

There are 10 characters in the string value, and 10 nodes in the generated XML.

We can do it using MySQL's LPAD function:

SELECT  ref, LPAD('', LENGTH(ref) * LENGTH(node), node) AS xml
FROM    (
SELECT  ref, CONCAT('<e>', ref, '</e>') AS node
FROM    t_data a
) q

ref xml
3536757616 <e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e>
3536757617 <e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e>
3536757618 <e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e>
3536757628 <e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e>
3536757629 <e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e>
3536757630 <e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e>
6 rows fetched in 0.0009s (0.0014s)

Now, we should pick a ref for each id and find the lengths of the LCP for this initial ref and all other ref's within this id.

To do this, we will first select all XML nodes whose substring(1, position()) matches one for the initref, and second, just count these ref's with XPath's count function.

This is done like this:

SELECT  id,
CONCAT('<i>', initref, '</i>', LPAD('', LENGTH(ref) * LENGTH(node), node)) AS xml,
CAST(ExtractValue(CONCAT('<i>', initref, '</i>', LPAD('', LENGTH(ref) * LENGTH(node), node)), 'count(e[substring(., 1, position()) = substring(/i, 1, position())])') AS DECIMAL) AS lcp
FROM    (
SELECT  id, ref, CONCAT('<e>', ref, '</e>') AS node,
(
SELECT  ref
FROM    t_data ai
WHERE   ai.id = a.id
LIMIT 1
) AS initref
FROM    t_data a
) q

id xml lcp
1 <i>3536757616</i><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e><e>3536757616</e> 10
1 <i>3536757616</i><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e><e>3536757617</e> 9
1 <i>3536757616</i><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e><e>3536757618</e> 9
2 <i>3536757628</i><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e><e>3536757628</e> 10
2 <i>3536757628</i><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e><e>3536757629</e> 9
2 <i>3536757628</i><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e><e>3536757630</e> 8
6 rows fetched in 0.0010s (0.0022s)

The following XPath expression:

count(e[substring(., 1, position()) = substring(/i, 1, position())])

returns count of those nodes for which the prefixes match those of initref.

The prefix lengths are incremented one char per node, that's why this value is in fact the length of the LCP for two strings.

Now, we have the lengths of the LCP's of the first ref and all other ref's within each id.

This is enough to calculate common LCP for an id (which is the minimal of all LCP's for this id), strip the ref's of this LCP and concatenate the results.

Here's the resulting query:

SELECT  id,
(
SELECT  CONCAT(
SUBSTR(MIN(ref), 1, lcp),
GROUP_CONCAT(SUBSTR(ref, lcp + 1) SEPARATOR '/')
)
FROM    t_data ai
WHERE   ai.id = q3.id
) AS result

FROM    (
SELECT  id, MIN(lcp) AS lcp
FROM    (
SELECT  id,
CAST(ExtractValue(CONCAT('<i>', initref, '</i>', LPAD('', LENGTH(ref) * LENGTH(node), node)), 'count(e[substring(., 1, position()) = substring(/i, 1, position())])') AS DECIMAL) AS lcp
FROM    (
SELECT  id, ref, CONCAT('<e>', ref, '</e>') AS node,
(
SELECT  ref
FROM    t_data ai
WHERE   ai.id = a.id
LIMIT 1
) AS initref
FROM    t_data a
) q
) q2
GROUP BY
id
) q3
id result
1 3536757616/7/8
2 3536757628/29/30
2 rows fetched in 0.0001s (0.0024s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived3> ALL 2 100.00
3 DERIVED <derived4> ALL 6 100.00 Using temporary; Using filesort
4 DERIVED <derived5> ALL 6 100.00
5 DERIVED a ALL 6 100.00
6 DEPENDENT SUBQUERY ai ALL 6 100.00 Using where
2 DEPENDENT SUBQUERY ai ALL 6 100.00 Using where

This solution has a drawback of being prone to XML injection, since the values are inserted into XML as is.

This, however, can be easily worked around by XML escaping the characters, which I didn't do here for the sake of brevity.

Written by Quassnoi

June 7th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply