EXPLAIN EXTENDED

How to create fast database queries

Article-aware title filtering: internationalization

Comments enabled. I *really* need your comment

In the previous article, I described a query that searches for a phrase beginning with a certain letter or string, ignoring a leading article if any. This is useful in searching for movie titles.

Today, I'll add some more features to this query.

Let's pretend we are running an internationalized website which contains movie titles in many languages. We have a table showing us whether a certain word in the beginning of a title is an article, particle or any other part of speech that should be omitted when filtering:

SELECT  *
FROM    t_article
art_id art_article
1 A
2 An
8 Das
6 Der
7 Die
4 Ein
5 Eine
17 Gli
15 I
12 Il
13 La
16 Le
14 Lo
3 The
9 Un
11 Una
10 Uno
17 rows fetched in 0.0004s (0.0024s)

To select all titles beginning with a certain string, we can use the same approach we used when the articles were hardcoded:

SELECT  mov_title
FROM    (
        SELECT  CONCAT(art_article, ' ') AS article
        FROM    t_article
        UNION ALL
        SELECT  ''
        ) articles, t_movie
WHERE   mov_title LIKE CONCAT(article, 'Can%')
mov_title
Eine Canelaqo
Canypi
Gli Canijo
Canohywoqi
Die Canejebuhu
Canoly
Gli Canilemi
An Cany
Cane
Un Caniju
Cano
Canujecoloxe
A Canekajo
Canujidy
Canimo
The Cana
Canahiziwy
I Canukohe
Die Canymi
A Canihytabu
Uno Canikaha
Cani
22 rows fetched in 0.0004s (1.3437s)

But for some reason it doesn't work as efficiently anymore.

This is because MySQL cannot use range access method in this case.

MySQL documentation states that range access can be used only when the expression used in a comparison is a constant, i. e. one of the following:

  • A constant from the query string
  • A column of a const or system table from the same join
  • The result of an uncorrelated subquery
  • Any expression composed entirely from subexpressions of the preceding types

Unfortunately, the expression we use in comparison doesn't fall under any of these categories.

For the range access method to be usable, we need to use <= and >= comparisons in the where clause. We need to find all titles that lie between the initial (that is, the searched for string prepended with an article), and some greatest value of those beginning with the initial.

In case of integers, each value has at most one unique predecessor and at most one unique successor, which we can calculate by adding or substracting a 1 to the value.

In case of strings, it's not so easy: strings are not well-ordered, that's why we cannot calculate the greatest element beginning with the initial: such an element just does not exists. Even calculating the least element which is greater than the initial and not LIKE it, may be a complex task that should take into account case and accent sensitivity.

That's why we'll just take this least unlike element from the table itself.

We will need all movie titles that lie between initial and the first value in the string order that is NOT LIKE initial:

SELECT  mov_title
FROM    (
        SELECT  'Can' AS filter
        ) vars,
        (
        SELECT  CONCAT(art_article, ' ') AS article
        FROM    t_article
        UNION ALL
        SELECT  ''
        ) articles, t_movie
WHERE   mov_title >= CONCAT(article, filter)
        AND mov_title <
        (
        SELECT  mov_title
        FROM    t_movie
        WHERE   mov_title >= CONCAT(article, filter)
                AND mov_title NOT LIKE CONCAT(article, filter, '%')
        LIMIT 1
        )
mov_title
A Canekajo
A Canihytabu
An Cany
Die Canejebuhu
Die Canymi
Eine Canelaqo
Gli Canijo
Gli Canilemi
I Canukohe
The Cana
Un Caniju
Uno Canikaha
Canahiziwy
Cane
Cani
Canimo
Cano
Canohywoqi
Canoly
Canujecoloxe
Canujidy
Canypi
22 rows fetched in 0.0005s (3.1718s)

This runs even longer. This is not so surprising, as this is a well-known bug in MySQL that does not let to use range access method in dependent subqueries.

To work around this bug, we will need to create a user-defined function that returns this least unlike element:

CREATE FUNCTION fn_get_least_unlike(initial VARCHAR(200)) RETURNS VARCHAR(200)
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _title VARCHAR(200);
        DECLARE EXIT HANDLER FOR NOT FOUND RETURN UNHEX('FFFF');
        SELECT  mov_title
        INTO    _title
        FROM    t_movie
        WHERE   mov_title >= initial
                AND mov_title NOT LIKE CONCAT(initial, '%')
        ORDER BY
                mov_title
        LIMIT 1;
        RETURN _title;
END

, and use it in the query:

SELECT mov_title
FROM (
SELECT 'Can' AS filter
) vars,
(
SELECT CONCAT(art_article, ' ') AS article
FROM t_article
UNION ALL
SELECT ''
) articles, t_movie
WHERE mov_title >= CONCAT(article, filter)
AND mov_title < fn_get_least_unlike(CONCAT(article, filter)) [/sourcecode]

mov_title
A Canekajo
A Canihytabu
An Cany
Die Canejebuhu
Die Canymi
Eine Canelaqo
Gli Canijo
Gli Canilemi
I Canukohe
The Cana
Un Caniju
Uno Canikaha
Canahiziwy
Cane
Cani
Canimo
Cano
Canohywoqi
Canoly
Canujecoloxe
Canujidy
Canypi
22 rows fetched in 0.0004s (0.0109s)

Finally, it works as fast as it should.

Written by Quassnoi

March 24th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply