EXPLAIN EXTENDED

How to create fast database queries

Things SQL needs: determining range cardinality

with 14 comments

What is the problem with this query?

SELECT  *
FROM    orders
WHERE   quantity <= 4
        AND urgency <= 4


The problem is indexing strategy, of course. Which columns should we index?

If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.

Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.

Why, create a composite index!, some will say.

Unfortunately, that won't help much.

A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.

The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.

Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:

, and still no single range that contains only the records we need.

Can we improve it?

If we take at closer look at the table contents we will see that despite the fact that there is no single range that contains the records we need (and only them), there are four ranges that do filter our records.

If we rewrote the query condition like that:

SELECT  *
FROM    orders
WHERE   quantity IN (0, 1, 2, 3, 4)
        AND urgency <= 4

with the first index, or like that:

SELECT  *
FROM    orders
WHERE   quantity <= 4
        AND urgency IN (0, 1, 2, 3, 4)

with the second index, then any decent SQL engine would build a decent and efficient plan (Index range in MySQL, INLIST ITERATOR in Oracle etc.)

The problem is that rewriting this query still requires an SQL developer. But this could be done automatically by the optimizer. There are several methods to do that.

Smallest superset

A range condition defines a set of the values that could possibly satisfy the condition (i. e. belong to the range). The number of the values that satisfy this condition is called set cardinality. The word number is quoted here because it can be infinite, and not all infinities are created equal: some are more infinite than the others!

However, if we take the column definition into account, we can see that some ranges define finite (and quite constrained) sets of possible values. For instance, a condition like quantity < 4 on a column that is defined as UNSIGNED INT can possibly be satisfied by five values: 0, 1, 2, 3 and 4.

This set is the smallest superset of all sets of values that satisfy the range condition.

Loose index scan

Even with the conditions that theoretically define the infinite sets of values that could satisfy them, practically there is always a finite number of values in the table that do satisfy them (the table itself contains the finite number of records, to begin with).

And most engines keep track of that number in their statistics tables: this is what is called field cardinality, a measure of field uniqueness.

If the range cardinality is expected to be low (either from the set of values that can possibly belong to the range, or from the actual number of distinct values that do belong to the range, according to statistics), it would be a wise idea to rewrite the range condition as an IN condition containing all possible values that can belong or do belong to the range.

This will replace a single less than or greater then with a small number of equals to. And an equals to gives the optimizer much more space to, um, optimize. It can use a HASH JOIN, split the index into a number of continuous ranges or do some other interesting things that can only be done with an equijoin.

With an UNSIGNED INTEGER column, it is easy to generate a set of values that could satisfy the range. But what if we know the range cardinality to be low from the statistics, not from the column datatype?

In this case, we could build the set of possible values using what MySQL calls a loose index scan.

Basically, it takes the first record from the index and then recursively searches the next lowest record whose key value is greater than the previous one, using the index seek (as opposed to index scan). This means instead of mere scanning the index and applying the condition to each field, the engine would walk up and down the B-Tree to locate the first record the the greater key. It is much more efficient when the number of distinct keys is small. And in fact, MySQL does use this method for queries that involve SELECT DISTINCT on an indexed field.

MIN / MAX

This method would be useful for the open ranges (like > or <, as opposed to BETWEEN).

By default, the INT column means SIGNED INT. For a condition like quantity <= 4 this would mean all integers from -2,147,483,647 to 4 which is way too many.

In real tables, the quantities would be something greater than 0. But not all developers bother to pick a right datatype or add CHECK constraints for their columns (and some databases like PostgreSQL lack the unsigned datatypes anyway).

To work around this, we could find the minimum existing value in the index using a single index seek. It would serve as a lower bound for the range. Since in real table that would most probably be something like 0 or 1, that would make the range much more constrained.

All these three methods could be used at the same time. Since the methods require nothing but just a single lookup of the statistics table and a single index seek, the most efficient method to return the subset of values that could satisfy the range could be chosen at runtime.

Implementation

Now, let's make a sample table in PostgreSQL and see how could we benefit from replacing the low-cardinality ranges with the lists of values:

Table creation details

This table contains 16,000,000 records with two integer fields and two double precision fields.

There are composite indexes on the pairs of fields. This indexes are intentionally created with the least selective column leading to demonstrate the benefits of range transformation.

First, let's run a query similar to the original one:

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 <= 20
        AND uint2 <= 20
sum
75600
1 row fetched in 0.0001s (5.1249s)
Aggregate  (cost=171331.57..171331.58 rows=1 width=204)
  ->  Index Scan using ix_composite_uint on t_composite  (cost=0.00..171329.57 rows=796 width=204)
        Index Cond: ((uint1 <= 20) AND (uint2 <= 20))

The plan says that the index condition involves both fields. However, only the first field is used in the B-Tree search: the second is just being filtered on, though no actual table access is performed yet on this step.

The index keys are not too long, however, there are several millions of them that need to be scanned. That's why the query takes more than 5 seconds to complete.

The I/O statistics show the following:

SELECT  pg_stat_reset();

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 <= 20
        AND uint2 <= 20;

SELECT  pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
pg_stat_get_blocks_fetched
21865

The query required more than twenty thousands of index blocks to be read and examined

Smallest superset

Now, let's try to substitute the hard-coded list of possible values instead of the range condition:

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 IN
        (
        SELECT  generate_series(0, 20)
        )
        AND uint2 <= 20

sum
75600
1 row fetched in 0.0001s (0.0063s)
Aggregate  (cost=167.47..167.49 rows=1 width=204)
  ->  Nested Loop  (cost=0.02..167.37 rows=40 width=204)
        ->  HashAggregate  (cost=0.02..0.03 rows=1 width=4)
              ->  Result  (cost=0.00..0.01 rows=1 width=0)
        ->  Index Scan using ix_composite_uint on t_composite  (cost=0.00..166.84 rows=40 width=208)
              Index Cond: ((t_composite.uint1 = (generate_series(0, 20))) AND (t_composite.uint2 <= 20))

Instead of a giant singe range, there are 21 short ranges examined in a nested loop. This is instant (6 ms).

Let's look into the I/O statistics again:

SELECT  pg_stat_reset();

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 IN
        (
        SELECT  generate_series(0, 20)
        )
        AND uint2 <= 20;

SELECT  pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
pg_stat_get_blocks_fetched
64

Now, only 64 blocks need to be read.

MIN / MAX

We took 0 as the initial value, but since theoretically there can be negative numbers in the columns, this assumption is not safe.

We need to get the least value from the table instead of assuming it:

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 IN
        (
        SELECT  generate_series(
                (
                SELECT  MIN(uint1)
                FROM    t_composite
                ), 20)
        )
        AND uint2 <= 20
sum
75600
1 row fetched in 0.0001s (0.0066s)
Aggregate  (cost=171.40..171.41 rows=1 width=204)
  ->  Nested Loop  (cost=3.95..171.29 rows=40 width=204)
        ->  HashAggregate  (cost=3.95..3.96 rows=1 width=4)
              ->  Result  (cost=3.92..3.93 rows=1 width=0)
                    InitPlan 2 (returns $1)
                      ->  Result  (cost=3.91..3.92 rows=1 width=0)
                            InitPlan 1 (returns $0)
                              ->  Limit  (cost=0.00..3.91 rows=1 width=4)
                                    ->  Index Scan using ix_composite_uint on t_composite  (cost=0.00..62571556.68 rows=15999664 width=4)
                                          Filter: (uint1 IS NOT NULL)
        ->  Index Scan using ix_composite_uint on t_composite  (cost=0.00..166.84 rows=40 width=208)
              Index Cond: (("20100519_cardinality".t_composite.uint1 = (generate_series($1, 20))) AND ("20100519_cardinality".t_composite.uint2 <= 20))

This is instant again.

Now, what about statistics?

SELECT  pg_stat_reset();

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND uint1 IN
        (
        SELECT  generate_series(
                (
                SELECT  MIN(uint1)
                FROM    t_composite
                ), 20)
        )
        AND uint2 <= 20;

SELECT  pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
pg_stat_get_blocks_fetched
67

The query is only 3 block reads heavier, but this time it is guaranteed to be correct.

Loose index scan

The real* columns hold the double precision data.

First, let's run the original query:

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND real1 <= 0.2
        AND real2 <= 0.2

sum
83600
1 row fetched in 0.0001s (6.6561s)
Aggregate  (cost=206210.82..206210.84 rows=1 width=204)
  ->  Index Scan using ix_composite_real on t_composite  (cost=0.00..206208.84 rows=793 width=204)
        Index Cond: ((real1 <= 0.2::double precision) AND (real2 <= 0.2::double precision))

Again, there are way too many block reads:

SELECT  pg_stat_reset();

SELECT  SUM(LENGTH(stuffing))
FROM    t_composite
WHERE   1 = 1
        AND real1 <= 0.2
        AND real2 <= 0.2;

SELECT  pg_stat_get_blocks_fetched('ix_composite_real'::regclass);
pg_stat_get_blocks_fetched
30657

, and the query takes more than 6 seconds.

For a condition like real1 <= 0.2, the smallest superset of all possible values (that is all possible double-precision values between 0 and 0.2) would be too large (though still finite of course) to be generated and joined. That's why we need to use server-collected statistics to decide whether a loose index scan would be efficient to get the list of all distinct values of real1 in the table:

SELECT  n_distinct, most_common_vals, histogram_bounds
FROM    pg_stats
WHERE   schemaname = '20100519_cardinality'
        AND tablename = 't_composite'
        AND attname IN ('real1', 'real2')

n_distinct most_common_vals histogram_bounds
40 {0.09,0.22,0.01,0.37,0.04,0.25,0.28,0.34,0.27,0.06,0.36,0.11,0.08,0.39,0.12,0.2,0.02,0.16,0.17,0.21,0.29,0.18,0.19,0.26,0.3,0.32,0.35,0.1,0.13,0.4,0.15,0.23,0.38,0.03,0.33,0.24,0.07,0.14,0.05,0.31}
362181 {1781.85,128.6,142.73,257.88,332.62,618.61,705.35,829.91,845.82,874.08,1432.82,1469.16,1486.01,1569.16,1866.43,2111.48,2234.48,2282.78,2340.7,2382.54,2468.19,2491.35,2494.73,2508.51,2587.51,2750.98,2876.07,2956.11,3222.62,3463.2,3564.41,3872.25,3.43,13.22,14.3,15.82,16.12,16.36,19.98,21.79,23.03,25.41,28.24,28.96,31.95,35.1,36.31,38.46,39.72,39.87,42.42,49.32,50.84,56.59,60.31,81.8,84.48,84.74,86.62,88.58,91.82,103.48,112.88,114.99,117.55,119.55,122.52,123.04,127.98,128.26,130.56,131.15,134.57,134.91,135.8,139.05,139.41,141.39,146.13,151.92,158.16,158.58,161.71,169.92,170.43,173.3,173.74,178.06,189.43,192.99,199.69,210.98,218.45,225.44,230.16,233.25,233.67,240.9,246.5,249.77} {0.07,44.3,86.34,129.18,172.01,212.46,254.83,294.83,336.18,374.44,414.75,458.07,498.72,540.88,582.92,621.95,660.28,700.75,738.46,778.95,822.09,857.34,896.27,937.63,977.05,1017.49,1062.93,1104.88,1147.17,1185.99,1225.41,1267.8,1307.45,1347.49,1386.1,1424.05,1465.05,1503.23,1542.03,1581.45,1619.53,1658.57,1697.74,1741.65,1782.59,1823.19,1867.54,1905.59,1945.39,1986.04,2022.32,2061.52,2102.77,2143.07,2180.29,2218.06,2262.62,2302.28,2342.05,2382.29,2416.18,2455.43,2489.25,2528.07,2572.15,2606.91,2648.59,2685.43,2724.9,2765.49,2805.78,2845.94,2886.33,2925.52,2967.07,3005.12,3046.34,3084.44,3122.81,3158.25,3199.74,3242.4,3279.34,3319.41,3359.39,3401.15,3436.32,3477.28,3517.05,3559.6,3599.6,3639.63,3678.82,3720.27,3760.32,3801.89,3844.98,3883.96,3922.63,3963.01,3999.98}
2 rows fetched in 0.0003s (0.0072s)

From the server statistics we see that there are only 40 distinct values of real1. Hence, a loose index scan as such would be efficient.

Let's look into the stats on real2. We see that there are 362181 distinct values in the table, and the range in question (<= 0.2) corresponds to the first entry in the histogram (BETWEEN 0.07 AND 44.3). Since the histogram splits the values into 100 percentiles, this means that there are about 3622 values from 0.07 to 44.3, and ((0.2 - 0.07) / (44.3 - 0.07) * 362181 / 100) ≈ 11 distinct values inside our range, and about 11 * 16000000 /362181 ≈ 486 records with these values.

Taking into account that the original query would need to scan 8,000,000 records, the loose index scan seems to be a good idea.

Unfortunately, PostgreSQL does not support it directly, but with minimal effort it can be emulated:

WITH    RECURSIVE q (real1) AS
        (
        SELECT  MIN(real1)
        FROM    t_composite
        UNION ALL
        SELECT  (
                SELECT  c.real1
                FROM    t_composite c
                WHERE   c.real1 > q.real1
                        AND c.real1 <= 0.2
                ORDER BY
                        c.real1
                LIMIT 1
                )
        FROM    q
        WHERE   q.real1 IS NOT NULL
        )
SELECT  SUM(LENGTH(stuffing))
FROM    q
JOIN    t_composite c
ON      c.real1 = q.real1
        AND c.real2 <= 0.2

sum
83600
1 row fetched in 0.0001s (0.0082s)
Aggregate  (cost=17319.78..17319.80 rows=1 width=204)
  CTE q
    ->  Recursive Union  (cost=3.92..401.33 rows=101 width=8)
          ->  Result  (cost=3.92..3.93 rows=1 width=0)
                InitPlan 1 (returns $1)
                  ->  Limit  (cost=0.00..3.92 rows=1 width=8)
                        ->  Index Scan using ix_composite_real on t_composite  (cost=0.00..62694798.76 rows=15999664 width=8)
                              Filter: (real1 IS NOT NULL)
          ->  WorkTable Scan on q  (cost=0.00..39.54 rows=10 width=8)
                Filter: (q.real1 IS NOT NULL)
                SubPlan 2
                  ->  Limit  (cost=0.00..3.93 rows=1 width=8)
                        ->  Index Scan using ix_composite_real on t_composite c  (cost=0.00..314695.34 rows=79998 width=8)
                              Index Cond: ((real1 > $2) AND (real1 <= 0.2::double precision))
  ->  Nested Loop  (cost=0.00..16914.48 rows=1588 width=204)
        ->  CTE Scan on q  (cost=0.00..2.02 rows=101 width=8)
        ->  Index Scan using ix_composite_real on t_composite c  (cost=0.00..166.95 rows=40 width=212)
              Index Cond: ((c.real1 = q.real1) AND (c.real2 <= 0.2::double precision))

This is almost instant again. Here are the I/O statistics:

SELECT  pg_stat_reset();

WITH    RECURSIVE q (real1) AS
        (
        SELECT  MIN(real1)
        FROM    t_composite
        UNION ALL
        SELECT  (
                SELECT  c.real1
                FROM    t_composite c
                WHERE   c.real1 > q.real1
                        AND c.real1 <= 0.2
                ORDER BY
                        c.real1
                LIMIT 1
                )
        FROM    q
        WHERE   q.real1 IS NOT NULL
        )
SELECT  SUM(LENGTH(stuffing))
FROM    q
JOIN    t_composite c
ON      c.real1 = q.real1
        AND c.real2 <= 0.2;

SELECT  pg_stat_get_blocks_fetched('ix_composite_real'::regclass);
pg_stat_get_blocks_fetched
165

The number of index reads is reduced greatly again.

Summary

In some cases, a range predicate (like less than, greater than or between) can be rewritten as an IN predicate against the list of values that could satisfy the range condition.

Depending on the column datatype, check constraints and statistics, that list could be comprised of all possible values defined by the column's domain; all possible values defined by column's minimal and maximal value, or all actual distinct values contained in the table. In the latter case, a loose index scan could be used to retrieve the list of such values.

Since an equality condition is applied to each value in the list, more access and join methods could be used to build the query plain, including range conditions on secondary index columns, hash lookups etc.

Whenever the optimizer builds a plan for a query that contains a range predicate, it should consider rewriting the range condition as an IN predicate and use the latter method if it proves more efficient.

Written by Quassnoi

May 19th, 2010 at 11:00 pm

Posted in Miscellaneous

14 Responses to 'Things SQL needs: determining range cardinality'

Subscribe to comments with RSS or TrackBack to 'Things SQL needs: determining range cardinality'.

  1. great article :)
    Thanks

    marc-o

    23 May 10 at 23:00

  2. Great article, and I think it provides an answer to a stackoverflow question. I posted an answer with a link to this article and I included your summary in the answer. I hope this is ok. The link to the answer is:

    http://stackoverflow.com/questions/2914082/how-do-you-design-a-database-to-allow-fast-multicolumn-searching/2915541#2915541

    Trey Jackson

    26 May 10 at 22:08

  3. @Trey: the mechanism used for a multi-column search with ranged conditions would be similar indeed. The article actually discusses the mechanism’s integration into the query optimizers: an experienced developer can rewrite the query using the same approach I used in the article. The query still needs to be backed by a composite index (or a number of indexes) on the appropriate columns, though.

    Quassnoi

    27 May 10 at 02:05

  4. Nice article, as always. Big fan of your site, thanks!

    Frode

    18 Jun 10 at 15:48

  5. @Frode: thanks!

    Quassnoi

    22 Jun 10 at 11:15

  6. Thanks for letting us understand how indexing works, I did understood well now.

    Thanks for the good work.

    Just curious to know how does these number works with “joins” (many-tables).

    Rakesh

    2 Jul 10 at 11:35

  7. @author/admin:-
    If you guys have a plugin which will give you the list of related articles in end of each blog sections will be great.

    Rakesh

    2 Jul 10 at 11:37

  8. @Quassnoi thank you for a deeply thought-provoking article. I personally find it shocking that a RDBMS cannot efficiently process ranges, as you have proven. I thought that < 20 would have done exactly what IN () is doing. To me, it seems obvious that is should. Does SQL Server suffer this fate too?

    On a related point, if those columns were individually indexed, as opposed to indexed as you have done with low cardinality, would we still be looking at the scenario you outline here?

    Ian

    14 Nov 10 at 01:14

  9. @Ian:

    I thought that < 20 would have done exactly what IN () is doing

    IN() on what column type?

    If the column type is, say, FLOAT, the smallest superset of the values that satisfy the condition is close to infinity (I know this is wrong in math sense, but you know what I mean). It’s infeasible to convert the range into the set.

    If the type is UNSIGNED INT then of course it can and sometimes needs to be done.

    No optimizer known to me is able to do it, if that was your question, that’s why I wrote this article.

    On a related point, if those columns were individually indexed, as opposed to indexed as you have done with low cardinality, would we still be looking at the scenario you outline here?

    The optimizers are capable of doing INDEX_INTERSECT, that is selecting the recordsets satisfying each condition separately then joining these recordsets. This would require scanning 20000 + 20000 = 40000 records plus overhead of joining them, as opposed to just 400 records with a range rewrite.

    Quassnoi

    14 Nov 10 at 16:28

  10. @Quassnoi thanks for your reply. I mean IN() on int types. Ok, something here seems illogical to me (not from you). Taking your example, let’s look at the quantity column. If it is indexed, then we know it’s sorted. So if we query “<= 4", then surely the query engine will walk only the top of the index until it finds 5, at which point it will quit, having found all the records that satisfy "<= 4". Then it would either do a table scan **only on the relevant records** to filter urgency, OR, better, it would do a walk-through on the urgency index, find all records for "<= 4", and then perform an index intersect on the two index result sets, giving us the RIDs we're looking for. To me, this is logical. I don't see why it needs to scan 40k records.

    Ian

    14 Nov 10 at 18:18

  11. So if we query “<= 4″, then surely the query engine will walk only the top of the index until it finds 5, at which point it will quit, having found all the records that satisfy “<= 4″

    That’s 5000 records (from 1 to 5000)

    Then it would either do a table scan **only on the relevant records** to filter urgency,

    Urgency is not a part of the index on quantity. We’ll have to do a table lookup for each of 5000 records, each of them being about five times as costly.

    Better, it would do a walk-through on the urgency index, find all records for “<= 4″, and then perform an index intersect on the two index result sets, giving us the RIDs we’re looking for

    That’s another 5000 records, plus the overhead of a join between the resultsets — opposed to 25 (twenty five) index seeks in a composite index with an IN rewrite.

    I don’t see why it needs to scan 40k records.

    If we substitute 5 with 20 (as in your original comment), we’ll get 40k vs 400. If we leave 5 as is, we’ll get 10k vs 25.

    Quassnoi

    14 Nov 10 at 19:13

  12. “That’s 5000 records (from 1 to 5000)”
    “That’s another 5000 records”

    I’m missing why it would need to do any record look-ups when all the data is contained in 2 indexes. Table lookups would only happen once the RIDs had been calculated via index lookup ops.

    Ian

    14 Nov 10 at 19:45

  13. I’m missing why it would need to do any record look-ups when all the data is contained in 2 indexes

    Each index, as you suggested them, contains data on the respective field only (one on quantity, another one on urgency). There is no information about urgency in the quantity index and vice versa. You should look that up in the table.

    Of course you could make both indexes covering, but this would still require you to scan 5k records (instead of but 25).

    Quassnoi

    14 Nov 10 at 20:34

  14. This article seems to be attempting a similar goal, but in a different way. http://wiki.postgresql.org/wiki/Loose_indexscan

    Ian

    2 Jan 11 at 11:48

Leave a Reply