This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
A quick reminder of the problem taken from Stack Overflow:
I have table item
and another table language
which contains names for the items
in different languages:
How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on two SELECT
list subqueries
- Use
COALESCE
on the results of two LEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In Oracle, SQL Server and PostgreSQL, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
To gain efficiency, all these systems used some kind of a join method that performs better on the large portions of the rowset, i. e. HASH JOIN
or MERGE JOIN
.
MySQL, however, is only capable of doing nested loops, so MySQL's performance should differ from one of the engines tested earlier.
Let's create sample tables and see:
Table creation scripts
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_item (
id INT NOT NULL PRIMARY KEY,
stuffing VARCHAR(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
CREATE TABLE t_language (
item INT NOT NULL,
language CHAR(3) NOT NULL,
name VARCHAR(50) NOT NULL,
PRIMARY KEY pk_language_language_item (language, item)
);
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;
INSERT
INTO t_item
SELECT id, RPAD(CONCAT('Item ', id), 100, ' ') AS stuffing
FROM filler;
INSERT
INTO t_language
SELECT id, lcode, CONCAT_WS(' ', lname, id)
FROM (
SELECT 'en' AS lcode, 'English' AS lname
UNION ALL
SELECT 'fr' AS lcode, 'French' AS lname
UNION ALL
SELECT 'la' AS lcode, 'Latin' AS lname
) language
CROSS JOIN
t_item
WHERE NOT (id % 1000 = 1 AND lcode = 'fr')
AND NOT(id % 500 <> 1 AND lcode = 'la');
t_item
contains 1,000,000 items.
t_language
contains 1,000,000 English names, 999,000 French names and 2,000 Latin names.
French is an example of low fallback probability language (good localization), Latin is an example of poor localization.
Two subqueries
Click below to see details for French and Latin queries:
French
SELECT SUM(
LENGTH(
COALESCE
(
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'fr'
),
(
SELECT name
FROM t_language len
WHERE item = id
AND language = 'en'
)
)
)
) AS sname
FROM t_item
sname |
12889896 |
1 row fetched in 0.0002s (7.0937s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
t_item |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
3 |
DEPENDENT SUBQUERY |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.t_item.id |
1 |
100.00 |
Using where |
2 |
DEPENDENT SUBQUERY |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.t_item.id |
1 |
100.00 |
Using where |
Latin
SELECT SUM(
LENGTH(
COALESCE
(
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'la'
),
(
SELECT name
FROM t_language len
WHERE item = id
AND language = 'en'
)
)
)
) AS sname
FROM t_item
sname |
13884896 |
1 row fetched in 0.0001s (26.6872s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
t_item |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
3 |
DEPENDENT SUBQUERY |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.t_item.id |
1 |
100.00 |
Using where |
2 |
DEPENDENT SUBQUERY |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.t_item.id |
1 |
100.00 |
Using where |
We see that the Latin query is much less efficient than the French one (26.6 seconds against 7 seconds).
This is because in MySQL using unbalanced composite keys, key misses are much less efficient than key hits.
Two LEFT JOINs
Click below to see details for French and Latin queries:
French
SELECT SUM(LENGTH(COALESCE(ln.name, len.name)))
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'fr'
LEFT JOIN
t_language len
ON len.item = i.id
AND len.language = 'en'
SUM(LENGTH(COALESCE(ln.name, len.name))) |
12889896 |
1 row fetched in 0.0001s (10.2967s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
i |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
1 |
SIMPLE |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
1 |
SIMPLE |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
Latin
SELECT SUM(LENGTH(COALESCE(ln.name, len.name)))
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'la'
LEFT JOIN
t_language len
ON len.item = i.id
AND len.language = 'en'
SUM(LENGTH(COALESCE(ln.name, len.name))) |
13884896 |
1 row fetched in 0.0001s (22.1560s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
i |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
1 |
SIMPLE |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
1 |
SIMPLE |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
We see that unlike other systems, the French query gets even worse if we replace subqueries with JOINs. The Latin query gets a little more efficient.
This is because LEFT JOIN
in MySQL behaves just like a subquery, except that it always evaluates both tables. MySQL always uses nested loops and no HASH JOIN
or MERGE JOIN
improvements are made.
The French query, however, is still more efficient than the Latin one.
One subquery and one join
Click below to see details for French and Latin queries:
French
SELECT SUM(
LENGTH(
COALESCE (
ln.name,
(
SELECT name
FROM t_language len
WHERE len.item = i.id
AND len.language = 'en'
)
)
)
)
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'fr'
cname |
12889896 |
1 row fetched in 0.0003s (4.7968s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
i |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
1 |
PRIMARY |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
2 |
DEPENDENT SUBQUERY |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
Using where |
Latin
SELECT SUM(
LENGTH(
COALESCE (
ln.name,
(
SELECT name
FROM t_language len
WHERE len.item = i.id
AND len.language = 'en'
)
)
)
)
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'la'
cname |
13884896 |
1 row fetched in 0.0004s (25.3591s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
i |
index |
|
PRIMARY |
4 |
|
1000187 |
100.00 |
Using index |
1 |
PRIMARY |
ln |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
|
2 |
DEPENDENT SUBQUERY |
len |
eq_ref |
PRIMARY |
PRIMARY |
13 |
const,20090810_fallback.i.id |
1 |
100.00 |
Using where |
The French query is more efficient in this case (4.85 seconds).
Latin one, on the other hand, has same efficiency.
Summary
Since MySQL uses only one join method, namely Nested Loops
, and key misses are less efficient than key hits, fallback probability has a reverse effect: high fallback queries are less performant than low fallback ones.
Two LEFT JOIN
(which are more efficient for the French query in Oracle, SQL Server and PostgreSQL) are in all cases less efficient than a LEFT JOIN
and a subquery.