EXPLAIN EXTENDED

How to create fast database queries

Fallback language names: SQL Server

Comments enabled. I *really* need your comment

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

From Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

  1. Use COALESCE on two SELECT list subqueries
  2. Use COALESCE on the results of two LEFT JOINS
  3. Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Yesterday, we saw that the second option (two LEFT JOINs) is best in Oracle if the probability of the fallback is high (i. e. the localization is poor and but few terms are translated into the local language). I took Latin language as an example for this.

Third query (a LEFT JOIN with the local language rowset and a subquery in a COALESCE to retrive the fallback language names) is best if there is a low probability of a fallback, i. e. almost all terms are translated. In this case, I took French as an example, since it is widely used and localizations are likely to cover most terms.

Today we will see how same queries work in SQL Server.

Let's create sample tables:

Table creation details

t_item contains 1,000,000 items.

t_language contains 1,000,000 English names, 999,000 French names and 2,000 Latin names.

French is an example of low fallback probability, Latin is the one of high fallback probability.

Two subqueries


Click on the link below to expand the queries

French

Latin

We see that the Latin and French queries yield completely different plans, and the Latin query is much more efficient (2.6 seconds for Latin, 9.0 seconds for French)

This is because SQL Server was able to use more efficient Hash Match for producing the resultset with the Latin subquery.

However, the English subquery is alwasy build on demand, that's why the Nested Loops are the only choise.

Two LEFT JOINs


Click on the link below to expand the queries

French

Latin

Both these queries produce exactly same plan, namely using two Merge Joins for both tables.

The Latin one is more efficient since there are fewer values to join here.

One subquery and one join


Click on the link below to expand the queries

French

Latin

The different languages generate different execution plans, but in this case the French is much more efficient than Latin.

The reason is the same as in Oracle: the fallback subquery is executed much more often for the missing Latin names (which are numerious) than for missing French names (that are rare).

Summary


Like in Oracle, in SQL Server there are two different efficient methods to select fallback values:

  • If the probability of fallback is high, i. e. there are few main values, it's more efficient to select both the main value and the fallback value in a LEFT JOIN, like this:
    SELECT  SUM(LEN(cname))
    FROM    (
            SELECT  COALESCE (ln.name, len.name) AS cname
            FROM    [20090807_fallback].t_item i
            LEFT JOIN
                    [20090807_fallback].t_language ln
            ON      ln.item = i.id
                    AND ln.language = 'la'
            LEFT JOIN
                    [20090807_fallback].t_language len
            ON      len.item = i.id
                    AND len.language = 'en'
            ) q
    
  • If the probability of fallback is low, i. e. there are many main values, it's more efficient to select the main value in a LEFT JOIN, and the fallback value in a subquery, which is a second argument to COALESCE, like this:

    SELECT  SUM(LEN(cname))
    FROM    (
            SELECT  COALESCE (
                    ln.name,
                    (
                    SELECT  name
                    FROM    [20090807_fallback].t_language len
                    WHERE   len.item = i.id
                            AND len.language = 'en'
    
                    )
                    ) AS cname
            FROM    [20090807_fallback].t_item i
            LEFT JOIN
                    [20090807_fallback].t_language ln
            ON      ln.item = i.id
                    AND ln.language = 'fr'
            ) q
    

    SQL Server always rewrites the join with the main rowset in most efficient way. However, the second argument to COALESCE is evaluated only if the first one is NULL. The subquery will be evaluated only in the rare cases when the main value will fail to retrieve.

To be continued.

Written by Quassnoi

August 7th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply