Fallback language names: MySQL at EXPLAIN EXTENDED

Fallback language names: MySQL

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

Use COALESCE on two SELECT list subqueries
Use COALESCE on the results of two LEFT JOINS
Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In Oracle, SQL Server and PostgreSQL, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

To gain efficiency, all these systems used some kind of a join method that performs better on the large portions of the rowset, i. e. HASH JOIN or MERGE JOIN.

MySQL, however, is only capable of doing nested loops, so MySQL's performance should differ from one of the engines tested earlier.

Let's create sample tables and see:

Table creation scripts

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_item (
        id INT NOT NULL PRIMARY KEY,
        stuffing VARCHAR(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;


CREATE TABLE t_language (
        item INT NOT NULL,
        language CHAR(3) NOT NULL,
        name VARCHAR(50) NOT NULL,
        PRIMARY KEY pk_language_language_item (language, item)
);

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_item
SELECT  id, RPAD(CONCAT('Item ', id), 100, ' ') AS stuffing
FROM    filler;

INSERT
INTO    t_language
SELECT  id, lcode, CONCAT_WS(' ', lname, id)
FROM    (
        SELECT  'en' AS lcode, 'English' AS lname
        UNION ALL
        SELECT  'fr' AS lcode, 'French' AS lname
        UNION ALL
        SELECT  'la' AS lcode, 'Latin' AS lname
        ) language
CROSS JOIN
        t_item
WHERE   NOT (id % 1000 = 1 AND lcode = 'fr')
        AND NOT(id % 500 <> 1 AND lcode = 'la');

t_item contains 1,000,000 items.

t_language contains 1,000,000 English names, 999,000 French names and 2,000 Latin names.

French is an example of low fallback probability language (good localization), Latin is an example of poor localization.

Two subqueries

Click below to see details for French and Latin queries:

French

SELECT  SUM(
        LENGTH(
        COALESCE
        (
        (
        SELECT  name
        FROM    t_language ln
        WHERE   item = id
                AND language = 'fr'
        ),
        (
        SELECT  name
        FROM    t_language len
        WHERE   item = id
                AND language = 'en'
        )
        )
        )
        ) AS sname
FROM    t_item

sname
12889896
1 row fetched in 0.0002s (7.0937s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t_item	index		PRIMARY	4		1000187	100.00	Using index
3	DEPENDENT SUBQUERY	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.t_item.id	1	100.00	Using where
2	DEPENDENT SUBQUERY	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.t_item.id	1	100.00	Using where

Latin

SELECT  SUM(
        LENGTH(
        COALESCE
        (
        (
        SELECT  name
        FROM    t_language ln
        WHERE   item = id
                AND language = 'la'
        ),
        (
        SELECT  name
        FROM    t_language len
        WHERE   item = id
                AND language = 'en'
        )
        )
        )
        ) AS sname
FROM    t_item

sname
13884896
1 row fetched in 0.0001s (26.6872s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t_item	index		PRIMARY	4		1000187	100.00	Using index
3	DEPENDENT SUBQUERY	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.t_item.id	1	100.00	Using where
2	DEPENDENT SUBQUERY	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.t_item.id	1	100.00	Using where

We see that the Latin query is much less efficient than the French one (26.6 seconds against 7 seconds).

This is because in MySQL using unbalanced composite keys, key misses are much less efficient than key hits.

Two LEFT JOINs

Click below to see details for French and Latin queries:

French

SELECT  SUM(LENGTH(COALESCE(ln.name, len.name)))
FROM    t_item i
LEFT JOIN
        t_language ln
ON      ln.item = i.id
        AND ln.language = 'fr'
LEFT JOIN
        t_language len
ON      len.item = i.id
        AND len.language = 'en'

SUM(LENGTH(COALESCE(ln.name, len.name)))
12889896
1 row fetched in 0.0001s (10.2967s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	i	index		PRIMARY	4		1000187	100.00	Using index
1	SIMPLE	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00
1	SIMPLE	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00

Latin

SELECT  SUM(LENGTH(COALESCE(ln.name, len.name)))
FROM    t_item i
LEFT JOIN
        t_language ln
ON      ln.item = i.id
        AND ln.language = 'la'
LEFT JOIN
        t_language len
ON      len.item = i.id
        AND len.language = 'en'

SUM(LENGTH(COALESCE(ln.name, len.name)))
13884896
1 row fetched in 0.0001s (22.1560s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	i	index		PRIMARY	4		1000187	100.00	Using index
1	SIMPLE	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00
1	SIMPLE	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00

We see that unlike other systems, the French query gets even worse if we replace subqueries with JOINs. The Latin query gets a little more efficient.

This is because LEFT JOIN in MySQL behaves just like a subquery, except that it always evaluates both tables. MySQL always uses nested loops and no HASH JOIN or MERGE JOIN improvements are made.

The French query, however, is still more efficient than the Latin one.

One subquery and one join

Click below to see details for French and Latin queries:

French

SELECT  SUM(
        LENGTH(
        COALESCE (
        ln.name,
        (
        SELECT  name
        FROM    t_language len
        WHERE   len.item = i.id
                AND len.language = 'en'

        )
        )
        )
        )
FROM    t_item i
LEFT JOIN
        t_language ln
ON      ln.item = i.id
        AND ln.language = 'fr'

cname
12889896
1 row fetched in 0.0003s (4.7968s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	i	index		PRIMARY	4		1000187	100.00	Using index
1	PRIMARY	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00
2	DEPENDENT SUBQUERY	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00	Using where

Latin

SELECT  SUM(
        LENGTH(
        COALESCE (
        ln.name,
        (
        SELECT  name
        FROM    t_language len
        WHERE   len.item = i.id
                AND len.language = 'en'

        )
        )
        )
        )
FROM    t_item i
LEFT JOIN
        t_language ln
ON      ln.item = i.id
        AND ln.language = 'la'

cname
13884896
1 row fetched in 0.0004s (25.3591s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	i	index		PRIMARY	4		1000187	100.00	Using index
1	PRIMARY	ln	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00
2	DEPENDENT SUBQUERY	len	eq_ref	PRIMARY	PRIMARY	13	const,20090810_fallback.i.id	1	100.00	Using where

The French query is more efficient in this case (4.85 seconds).

Latin one, on the other hand, has same efficiency.

Summary

Since MySQL uses only one join method, namely Nested Loops, and key misses are less efficient than key hits, fallback probability has a reverse effect: high fallback queries are less performant than low fallback ones.

Two LEFT JOIN (which are more efficient for the French query in Oracle, SQL Server and PostgreSQL) are in all cases less efficient than a LEFT JOIN and a subquery.

Written by Quassnoi

August 10th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED