EXPLAIN EXTENDED

How to create fast database queries

Fallback language names: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

Now, let's create sample tables and see how these queries behave in PostgreSQL:

Table creation details

t_item contains 1,000,000 items.

t_language contains 1,000,000 English names, 999,000 French names and 2,000 Latin names.

French is an example of low fallback probability language (good localization), Latin is an example of poor localization.

Two subqueries


Click on the link below to expand the queries.

French

Latin

We see that the Latin query performs more poorly than the French one (30 seconds and 17 seconds, respectively).

Unlike SQL Server, PostgreSQL cannot optimize the first query within the COALESCE into a JOIN and runs the subqueries in a loop.

The Latin query takes two subqueries in majority of cases, that's why it's almost 2 times as slow.

Two LEFT JOINs


Click on the link below to expand the queries.

French

Latin

These queries are highly optimized, and the Latin one is much more efficient than the French one (2.53 seconds vs. 4.12 seconds).

They also yield different execution plans (the order of the JOIN operations is swapped and Bitmap Index Scan is applied to the t_language to retrieve the Latin names instead of mere Seq Scan for the French query).

We should note that the index access method are very inefficient in PostgreSQL compared to the other systems. The nested loops used by the previous query are almost 10 times less efficient than the merge access, and sorting is always performed for the latter (though a presorted index exists).

This is probably due to the fact that PostgreSQL does not support clustered indexes, while in SQL Server the primary key indexes are clustered by default. PostgreSQL needs an extra JOIN on tid (tuple identifier, a row pointer used by the index) to rerieve the table values. In this case, the index lookup takes much longer and sorting becomes a faster operation even on a large dataset that comprises as many as 1,000,000 records (for the French and English names).

It's not clear, though, why PostgreSQL chose bitmap access in the latter case, since it used an Index Scan anyway, and Bitmap Heap Scan destroys the index sort order which later needs to be restored with a Sort. However, on 2,000 records is does not matter much.

One subquery and one join


Click on the link below to expand the queries.

French

Latin

As in Oracle and SQL Server, this query is much more efficient for the French names but vastly inefficient for the Latin ones.

This is, of course, also depends of the frequency of the local names being found for an item. The fallback subquery is not evaluated if the local name for an item is found, which makes the subquery evaluate often for the poorly localized Latin names, while for the French names, the fallback subquery is run only in rare cases.

Summary


Like in Oracle and SQL Server, in PostgreSQL there are two different efficient methods to select fallback values:

  • If the probability of fallback is high, i. e. there are few main values, it's more efficient to select both the main value and the fallback value in a LEFT JOIN, like this:
    SELECT  SUM(LENGTH(cname))
    FROM    (
            SELECT  COALESCE (ln.name, len.name) AS cname
            FROM    t_item i
            LEFT JOIN
                    t_language ln
            ON      ln.item = i.id
                    AND ln.language = 'la'
            LEFT JOIN
                    t_language len
            ON      len.item = i.id
                    AND len.language = 'en'
            ) q
    
  • If the probability of fallback is low, i. e. there are many main values, it's more efficient to select the main value in a LEFT JOIN, and the fallback value in a subquery, which is a second argument to COALESCE, like this:
    SELECT  SUM(LENGTH(cname))
    FROM    (
            SELECT  COALESCE (
                    ln.name,
                    (
                    SELECT  name
                    FROM    t_language len
                    WHERE   len.item = i.id
                            AND len.language = 'en'
    
                    )
                    ) AS cname
            FROM    t_item i
            LEFT JOIN
                    t_language ln
            ON      ln.item = i.id
                    AND ln.language = 'fr'
            ) q
    

To be continued.

Written by Quassnoi

August 8th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply