What is the problem with this query?
SELECT * FROM orders WHERE quantity <= 4 AND urgency <= 4
The problem is indexing strategy, of course. Which columns should we index?
If we index
quantity, the optimizer will be able to use the index to filter on it. However, filtering on
urgency will require scanning all records with
quantity < 4 and applying the
urgency filter to each record found.
urgency. We can use range access on
urgency using an index, but this will require filtering on
Why, create a composite index!, some will say.
Unfortunately, that won’t help much.
A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on
(quantity, urgency) will sort on
quantity, and only if the quantities are equal, it will take the
urgency into account.
The picture below shows how would the records be ordered in such an index:
As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.
Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
, and still no single range that contains only the records we need.
Can we improve it?
If we take at closer look at the table contents we will see that despite the fact that there is no single range that contains the records we need (and only them), there are four ranges that do filter our records.
If we rewrote the query condition like that:
SELECT * FROM orders WHERE quantity IN (0, 1, 2, 3, 4) AND urgency <= 4
with the first index, or like that:
SELECT * FROM orders WHERE quantity <= 4 AND urgency IN (0, 1, 2, 3, 4)
with the second index, then any decent SQL engine would build a decent and efficient plan (
Index range in MySQL,
INLIST ITERATOR in Oracle etc.)
The problem is that rewriting this query still requires an SQL developer. But this could be done automatically by the optimizer. There are several methods to do that.
A range condition defines a set of the values that could possibly satisfy the condition (i. e. belong to the range). The
number of the values that satisfy this condition is called set cardinality. The word number is quoted here because it can be infinite, and not all infinities are created equal: some are more infinite than the others!
However, if we take the column definition into account, we can see that some ranges define finite (and quite constrained) sets of possible values. For instance, a condition like
quantity < 4 on a column that is defined as
UNSIGNED INT can possibly be satisfied by five values: 0, 1, 2, 3 and 4.
This set is the smallest superset of all sets of values that satisfy the range condition.
Loose index scan
Even with the conditions that theoretically define the infinite sets of values that could satisfy them, practically there is always a finite number of values in the table that do satisfy them (the table itself contains the finite number of records, to begin with).
And most engines keep track of that number in their statistics tables: this is what is called field cardinality, a measure of field uniqueness.
If the range cardinality is expected to be low (either from the set of values that can possibly belong to the range, or from the actual number of distinct values that do belong to the range, according to statistics), it would be a wise idea to rewrite the range condition as an
IN condition containing all possible values that can belong or do belong to the range.
This will replace a single
less than or
greater then with a small number of
equals to. And an
equals to gives the optimizer much more space to, um, optimize. It can use a
HASH JOIN, split the index into a number of continuous ranges or do some other interesting things that can only be done with an equijoin.
UNSIGNED INTEGER column, it is easy to generate a set of values that could satisfy the range. But what if we know the range cardinality to be low from the statistics, not from the column datatype?
In this case, we could build the set of possible values using what MySQL calls a loose index scan.
Basically, it takes the first record from the index and then recursively searches the next lowest record whose key value is greater than the previous one, using the
index seek (as opposed to
index scan). This means instead of mere scanning the index and applying the condition to each field, the engine would walk up and down the B-Tree to locate the first record the the greater key. It is much more efficient when the number of distinct keys is small. And in fact, MySQL does use this method for queries that involve
SELECT DISTINCT on an indexed field.
MIN / MAX
This method would be useful for the open ranges (like
<, as opposed to
By default, the
INT column means
SIGNED INT. For a condition like
quantity <= 4 this would mean all integers from -2,147,483,647 to 4 which is way too many.
In real tables, the quantities would be something greater than 0. But not all developers bother to pick a right datatype or add
CHECK constraints for their columns (and some databases like PostgreSQL lack the unsigned datatypes anyway).
To work around this, we could find the minimum existing value in the index using a single index seek. It would serve as a lower bound for the range. Since in real table that would most probably be something like 0 or 1, that would make the range much more constrained.
All these three methods could be used at the same time. Since the methods require nothing but just a single lookup of the statistics table and a single index seek, the most efficient method to return the subset of values that could satisfy the range could be chosen at runtime.
Now, let’s make a sample table in PostgreSQL and see how could we benefit from replacing the low-cardinality ranges with the lists of values:
CREATE TABLE t_composite ( id INT NOT NULL, uint1 INT NOT NULL, uint2 INT NOT NULL, real1 DOUBLE PRECISION NOT NULL, real2 DOUBLE PRECISION NOT NULL, stuffing VARCHAR(200) NOT NULL ); SELECT SETSEED(0.20100518); INSERT INTO t_composite SELECT n, CEILING(RANDOM() * 40), CEILING(RANDOM() * 400000), CEILING(RANDOM() * 40) * 0.01, CEILING(RANDOM() * 400000) * 0.01, RPAD('', 200, '*') FROM generate_series(1, 16000000) n; ALTER TABLE t_composite ADD CONSTRAINT pk_composite_id PRIMARY KEY (id); CREATE INDEX ix_composite_uint ON t_composite (uint1, uint2); CREATE INDEX ix_composite_real ON t_composite (real1, real2);
This table contains 16,000,000 records with two integer fields and two
double precision fields.
There are composite indexes on the pairs of fields. This indexes are intentionally created with the least selective column leading to demonstrate the benefits of range transformation.
First, let’s run a query similar to the original one:
SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 <= 20 AND uint2 <= 20
|1 row fetched in 0.0001s (5.1249s)|
Aggregate (cost=171331.57..171331.58 rows=1 width=204) -> Index Scan using ix_composite_uint on t_composite (cost=0.00..171329.57 rows=796 width=204) Index Cond: ((uint1 <= 20) AND (uint2 <= 20))
The plan says that the index condition involves both fields. However, only the first field is used in the B-Tree search: the second is just being filtered on, though no actual table access is performed yet on this step.
The index keys are not too long, however, there are several millions of them that need to be scanned. That’s why the query takes more than 5 seconds to complete.
The I/O statistics show the following:
SELECT pg_stat_reset(); SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 <= 20 AND uint2 <= 20; SELECT pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
The query required more than twenty thousands of index blocks to be read and examined
Now, let’s try to substitute the hard-coded list of possible values instead of the range condition:
SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 IN ( SELECT generate_series(0, 20) ) AND uint2 <= 20
|1 row fetched in 0.0001s (0.0063s)|
Aggregate (cost=167.47..167.49 rows=1 width=204) -> Nested Loop (cost=0.02..167.37 rows=40 width=204) -> HashAggregate (cost=0.02..0.03 rows=1 width=4) -> Result (cost=0.00..0.01 rows=1 width=0) -> Index Scan using ix_composite_uint on t_composite (cost=0.00..166.84 rows=40 width=208) Index Cond: ((t_composite.uint1 = (generate_series(0, 20))) AND (t_composite.uint2 <= 20))
Instead of a giant singe range, there are 21 short ranges examined in a nested loop. This is instant (6 ms).
Let’s look into the I/O statistics again:
SELECT pg_stat_reset(); SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 IN ( SELECT generate_series(0, 20) ) AND uint2 <= 20; SELECT pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
Now, only 64 blocks need to be read.
MIN / MAX
We took 0 as the initial value, but since theoretically there can be negative numbers in the columns, this assumption is not safe.
We need to get the least value from the table instead of assuming it:
SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 IN ( SELECT generate_series( ( SELECT MIN(uint1) FROM t_composite ), 20) ) AND uint2 <= 20
|1 row fetched in 0.0001s (0.0066s)|
Aggregate (cost=171.40..171.41 rows=1 width=204) -> Nested Loop (cost=3.95..171.29 rows=40 width=204) -> HashAggregate (cost=3.95..3.96 rows=1 width=4) -> Result (cost=3.92..3.93 rows=1 width=0) InitPlan 2 (returns $1) -> Result (cost=3.91..3.92 rows=1 width=0) InitPlan 1 (returns $0) -> Limit (cost=0.00..3.91 rows=1 width=4) -> Index Scan using ix_composite_uint on t_composite (cost=0.00..62571556.68 rows=15999664 width=4) Filter: (uint1 IS NOT NULL) -> Index Scan using ix_composite_uint on t_composite (cost=0.00..166.84 rows=40 width=208) Index Cond: (("20100519_cardinality".t_composite.uint1 = (generate_series($1, 20))) AND ("20100519_cardinality".t_composite.uint2 <= 20))
This is instant again.
Now, what about statistics?
SELECT pg_stat_reset(); SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND uint1 IN ( SELECT generate_series( ( SELECT MIN(uint1) FROM t_composite ), 20) ) AND uint2 <= 20; SELECT pg_stat_get_blocks_fetched('ix_composite_uint'::regclass);
The query is only 3 block reads heavier, but this time it is guaranteed to be correct.
Loose index scan
real* columns hold the double precision data.
First, let’s run the original query:
SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND real1 <= 0.2 AND real2 <= 0.2
|1 row fetched in 0.0001s (6.6561s)|
Aggregate (cost=206210.82..206210.84 rows=1 width=204) -> Index Scan using ix_composite_real on t_composite (cost=0.00..206208.84 rows=793 width=204) Index Cond: ((real1 <= 0.2::double precision) AND (real2 <= 0.2::double precision))
Again, there are way too many block reads:
SELECT pg_stat_reset(); SELECT SUM(LENGTH(stuffing)) FROM t_composite WHERE 1 = 1 AND real1 <= 0.2 AND real2 <= 0.2; SELECT pg_stat_get_blocks_fetched('ix_composite_real'::regclass);
, and the query takes more than 6 seconds.
For a condition like
real1 <= 0.2, the smallest superset of all possible values (that is all possible double-precision values between 0 and 0.2) would be too large (though still finite of course) to be generated and joined. That’s why we need to use server-collected statistics to decide whether a loose index scan would be efficient to get the list of all distinct values of
real1 in the table:
SELECT n_distinct, most_common_vals, histogram_bounds FROM pg_stats WHERE schemaname = '20100519_cardinality' AND tablename = 't_composite' AND attname IN ('real1', 'real2')
|2 rows fetched in 0.0003s (0.0072s)|
From the server statistics we see that there are only 40 distinct values of
real1. Hence, a loose index scan as such would be efficient.
Let’s look into the stats on
real2. We see that there are
362181 distinct values in the table, and the range in question (
<= 0.2) corresponds to the first entry in the histogram (
BETWEEN 0.07 AND 44.3). Since the histogram splits the values into 100 percentiles, this means that there are about 3622 values from 0.07 to 44.3, and
((0.2 - 0.07) / (44.3 - 0.07) * 362181 / 100) ≈ 11 distinct values inside our range, and about
11 * 16000000 /362181 ≈ 486 records with these values.
Taking into account that the original query would need to scan 8,000,000 records, the loose index scan seems to be a good idea.
Unfortunately, PostgreSQL does not support it directly, but with minimal effort it can be emulated:
WITH RECURSIVE q (real1) AS ( SELECT MIN(real1) FROM t_composite UNION ALL SELECT ( SELECT c.real1 FROM t_composite c WHERE c.real1 > q.real1 AND c.real1 <= 0.2 ORDER BY c.real1 LIMIT 1 ) FROM q WHERE q.real1 IS NOT NULL ) SELECT SUM(LENGTH(stuffing)) FROM q JOIN t_composite c ON c.real1 = q.real1 AND c.real2 <= 0.2
|1 row fetched in 0.0001s (0.0082s)|
Aggregate (cost=17319.78..17319.80 rows=1 width=204) CTE q -> Recursive Union (cost=3.92..401.33 rows=101 width=8) -> Result (cost=3.92..3.93 rows=1 width=0) InitPlan 1 (returns $1) -> Limit (cost=0.00..3.92 rows=1 width=8) -> Index Scan using ix_composite_real on t_composite (cost=0.00..62694798.76 rows=15999664 width=8) Filter: (real1 IS NOT NULL) -> WorkTable Scan on q (cost=0.00..39.54 rows=10 width=8) Filter: (q.real1 IS NOT NULL) SubPlan 2 -> Limit (cost=0.00..3.93 rows=1 width=8) -> Index Scan using ix_composite_real on t_composite c (cost=0.00..314695.34 rows=79998 width=8) Index Cond: ((real1 > $2) AND (real1 <= 0.2::double precision)) -> Nested Loop (cost=0.00..16914.48 rows=1588 width=204) -> CTE Scan on q (cost=0.00..2.02 rows=101 width=8) -> Index Scan using ix_composite_real on t_composite c (cost=0.00..166.95 rows=40 width=212) Index Cond: ((c.real1 = q.real1) AND (c.real2 <= 0.2::double precision))
This is almost instant again. Here are the I/O statistics:
SELECT pg_stat_reset(); WITH RECURSIVE q (real1) AS ( SELECT MIN(real1) FROM t_composite UNION ALL SELECT ( SELECT c.real1 FROM t_composite c WHERE c.real1 > q.real1 AND c.real1 <= 0.2 ORDER BY c.real1 LIMIT 1 ) FROM q WHERE q.real1 IS NOT NULL ) SELECT SUM(LENGTH(stuffing)) FROM q JOIN t_composite c ON c.real1 = q.real1 AND c.real2 <= 0.2; SELECT pg_stat_get_blocks_fetched('ix_composite_real'::regclass);
The number of index reads is reduced greatly again.
In some cases, a range predicate (like
greater than or
between) can be rewritten as an
IN predicate against the list of values that could satisfy the range condition.
Depending on the column datatype, check constraints and statistics, that list could be comprised of all possible values defined by the column’s domain; all possible values defined by column’s minimal and maximal value, or all actual distinct values contained in the table. In the latter case, a loose index scan could be used to retrieve the list of such values.
Since an equality condition is applied to each value in the list, more access and join methods could be used to build the query plain, including range conditions on secondary index columns, hash lookups etc.
Whenever the optimizer builds a plan for a query that contains a range predicate, it should consider rewriting the range condition as an
IN predicate and use the latter method if it proves more efficient.