This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
A quick reminder of the problem taken from Stack Overflow:
I have table item
and another table language
which contains names for the items
in different languages:
How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on two SELECT
list subqueries
- Use
COALESCE
on the results of two LEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
Now, let's create sample tables and see how these queries behave in PostgreSQL:
Table creation details
CREATE TABLE t_item (id INT NOT NULL, stuffing VARCHAR(100) NOT NULL);
INSERT
INTO t_item
SELECT level AS id, RPAD('Item ' || level, 100, ' ') AS stuffing
FROM generate_series(1, 1000000) level;
ALTER TABLE t_item ADD CONSTRAINT pk_item_id PRIMARY KEY (id);
CREATE TABLE t_language (item INT NOT NULL, language CHAR(3) NOT NULL, name VARCHAR(50) NOT NULL);
INSERT
INTO t_language
SELECT id, CAST(lcode AS CHAR(3)), CAST(lname || ' ' || id AS VARCHAR(50))
FROM (
SELECT 'en' AS lcode, 'English' AS lname
UNION ALL
SELECT 'fr' AS lcode, 'French' AS lname
UNION ALL
SELECT 'la' AS lcode, 'Latin' AS lname
) language
CROSS JOIN
t_item
WHERE NOT (id % 1000 = 1 AND lcode = 'fr')
AND NOT(id % 500 <> 1 AND lcode = 'la');
ALTER TABLE t_language ADD CONSTRAINT pk_language_language_item PRIMARY KEY (language, item);
t_item
contains 1,000,000 items.
t_language
contains 1,000,000 English names, 999,000 French names and 2,000 Latin names.
French is an example of low fallback probability language (good localization), Latin is an example of poor localization.
Two subqueries
Click on the link below to expand the queries.
French
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE
(
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'fr'
),
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'en'
)
) AS cname
FROM t_item
) q
sum |
12889896 |
1 row fetched in 0.0001s (17.7809s) |
Aggregate (cost=29742.00..29758.99 rows=1 width=4)
-> Seq Scan on t_item (cost=0.00..27242.00 rows=1000000 width=4)
SubPlan 1
-> Index Scan using pk_language_language_item on t_language ln (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'fr'::bpchar) AND (item = $0))
SubPlan 2
-> Index Scan using pk_language_language_item on t_language ln (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'en'::bpchar) AND (item = $0))
Latin
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE
(
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'la'
),
(
SELECT name
FROM t_language ln
WHERE item = id
AND language = 'en'
)
) AS cname
FROM t_item
) q
sum |
13884896 |
1 row fetched in 0.0000s (30.2963s) |
Aggregate (cost=29742.00..29758.99 rows=1 width=4)
-> Seq Scan on t_item (cost=0.00..27242.00 rows=1000000 width=4)
SubPlan 1
-> Index Scan using pk_language_language_item on t_language ln (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'la'::bpchar) AND (item = $0))
SubPlan 2
-> Index Scan using pk_language_language_item on t_language ln (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'en'::bpchar) AND (item = $0))
We see that the Latin query performs more poorly than the French one (30 seconds and 17 seconds, respectively).
Unlike SQL Server, PostgreSQL cannot optimize the first query within the COALESCE
into a JOIN
and runs the subqueries in a loop.
The Latin
query takes two subqueries in majority of cases, that's why it's almost 2 times as slow.
Two LEFT JOIN
s
Click on the link below to expand the queries.
French
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (ln.name, len.name) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'fr'
LEFT JOIN
t_language len
ON len.item = i.id
AND len.language = 'en'
) q
sum |
12889896 |
1 row fetched in 0.0001s (4.1249s) |
Aggregate (cost=355326.06..355326.08 rows=1 width=28)
-> Merge Left Join (cost=274661.74..352826.06 rows=1000000 width=28)
Merge Cond: (i.id = ln.item)
-> Merge Left Join (cost=137812.03..198555.90 rows=1000000 width=18)
Merge Cond: (i.id = len.item)
-> Index Scan using pk_item_id on t_item i (cost=0.00..43225.36 rows=1000000 width=4)
-> Sort (cost=137812.03..140321.28 rows=1003702 width=18)
Sort Key: len.item
-> Seq Scan on t_language len (cost=0.00..37758.50 rows=1003702 width=18)
Filter: (language = 'en'::bpchar)
-> Sort (cost=136849.71..139336.45 rows=994697 width=18)
Sort Key: ln.item
-> Seq Scan on t_language ln (cost=0.00..37758.50 rows=994697 width=18)
Filter: (language = 'fr'::bpchar)
Latin
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (ln.name, len.name) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'la'
LEFT JOIN
t_language len
ON len.item = i.id
AND len.language = 'en'
) q
sum |
13884896 |
1 row fetched in 0.0001s (2.5312s) |
Aggregate (cost=210223.15..210223.16 rows=1 width=28)
-> Merge Left Join (cost=144440.27..207723.15 rows=1000000 width=28)
Merge Cond: (i.id = len.item)
-> Merge Left Join (cost=6628.24..52392.61 rows=1000000 width=18)
Merge Cond: (i.id = ln.item)
-> Index Scan using pk_item_id on t_item i (cost=0.00..43225.36 rows=1000000 width=4)
-> Sort (cost=6628.24..6634.74 rows=2601 width=18)
Sort Key: ln.item
-> Bitmap Heap Scan on t_language ln (cost=52.63..6480.70 rows=2601 width=18)
Recheck Cond: (language = 'la'::bpchar)
-> Bitmap Index Scan on pk_language_language_item (cost=0.00..51.98 rows=2601 width=0)
Index Cond: (language = 'la'::bpchar)
-> Sort (cost=137812.03..140321.28 rows=1003702 width=18)
Sort Key: len.item
-> Seq Scan on t_language len (cost=0.00..37758.50 rows=1003702 width=18)
Filter: (language = 'en'::bpchar)
These queries are highly optimized, and the Latin one is much more efficient than the French one (2.53 seconds vs. 4.12 seconds).
They also yield different execution plans (the order of the JOIN
operations is swapped and Bitmap Index Scan
is applied to the t_language
to retrieve the Latin
names instead of mere Seq Scan
for the French query).
We should note that the index access method are very inefficient in PostgreSQL compared to the other systems. The nested loops used by the previous query are almost 10 times less efficient than the merge access, and sorting is always performed for the latter (though a presorted index exists).
This is probably due to the fact that PostgreSQL does not support clustered indexes, while in SQL Server the primary key indexes are clustered by default. PostgreSQL needs an extra JOIN
on tid
(tuple identifier, a row pointer used by the index) to rerieve the table values. In this case, the index lookup takes much longer and sorting becomes a faster operation even on a large dataset that comprises as many as 1,000,000 records (for the French
and English
names).
It's not clear, though, why PostgreSQL chose bitmap access in the latter case, since it used an Index Scan
anyway, and Bitmap Heap Scan
destroys the index sort order which later needs to be restored with a Sort
. However, on 2,000 records is does not matter much.
One subquery and one join
Click on the link below to expand the queries.
French
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (
ln.name,
(
SELECT name
FROM t_language len
WHERE len.item = i.id
AND len.language = 'en'
)
) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'fr'
) q
sum |
12889896 |
1 row fetched in 0.0001s (2.3750s) |
Aggregate (cost=199995.52..200004.03 rows=1 width=18)
-> Merge Left Join (cost=136849.71..197495.52 rows=1000000 width=18)
Merge Cond: (i.id = ln.item)
-> Index Scan using pk_item_id on t_item i (cost=0.00..43225.36 rows=1000000 width=4)
-> Sort (cost=136849.71..139336.45 rows=994697 width=18)
Sort Key: ln.item
-> Seq Scan on t_language ln (cost=0.00..37758.50 rows=994697 width=18)
Filter: (language = 'fr'::bpchar)
SubPlan 1
-> Index Scan using pk_language_language_item on t_language len (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'en'::bpchar) AND (item = $0))
Latin
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (
ln.name,
(
SELECT name
FROM t_language len
WHERE len.item = i.id
AND len.language = 'en'
)
) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'la'
) q
sum |
13884896 |
1 row fetched in 0.0001s (19.5465s) |
Aggregate (cost=54892.61..54901.11 rows=1 width=18)
-> Merge Left Join (cost=6628.24..52392.61 rows=1000000 width=18)
Merge Cond: (i.id = ln.item)
-> Index Scan using pk_item_id on t_item i (cost=0.00..43225.36 rows=1000000 width=4)
-> Sort (cost=6628.24..6634.74 rows=2601 width=18)
Sort Key: ln.item
-> Bitmap Heap Scan on t_language ln (cost=52.63..6480.70 rows=2601 width=18)
Recheck Cond: (language = 'la'::bpchar)
-> Bitmap Index Scan on pk_language_language_item (cost=0.00..51.98 rows=2601 width=0)
Index Cond: (language = 'la'::bpchar)
SubPlan 1
-> Index Scan using pk_language_language_item on t_language len (cost=0.00..8.49 rows=1 width=14)
Index Cond: ((language = 'en'::bpchar) AND (item = $0))
As in Oracle and SQL Server, this query is much more efficient for the French names but vastly inefficient for the Latin ones.
This is, of course, also depends of the frequency of the local names being found for an item. The fallback subquery is not evaluated if the local name for an item is found, which makes the subquery evaluate often for the poorly localized Latin names, while for the French names, the fallback subquery is run only in rare cases.
Summary
Like in Oracle and SQL Server, in PostgreSQL there are two different efficient methods to select fallback values:
- If the probability of fallback is high, i. e. there are few main values, it's more efficient to select both the main value and the fallback value in a
LEFT JOIN
, like this:
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (ln.name, len.name) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'la'
LEFT JOIN
t_language len
ON len.item = i.id
AND len.language = 'en'
) q
- If the probability of fallback is low, i. e. there are many main values, it's more efficient to select the main value in a
LEFT JOIN
, and the fallback value in a subquery, which is a second argument to COALESCE
, like this:
SELECT SUM(LENGTH(cname))
FROM (
SELECT COALESCE (
ln.name,
(
SELECT name
FROM t_language len
WHERE len.item = i.id
AND len.language = 'en'
)
) AS cname
FROM t_item i
LEFT JOIN
t_language ln
ON ln.item = i.id
AND ln.language = 'fr'
) q
To be continued.