EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Creating indexes

Comments enabled. I *really* need your comment

Answering the questions asked on the site.

Dima asks:

I have a single rather large table with 18 columns.

Data in a table are continuous time stamped records.

On one hand, the application is appending new records to this table; on the other hand, users can do various queries.

My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?

There could be all sorts of combinations in WHERE clause, I don't want to limit users in their choices.

Is there any common sense strategy to decide on indexes in my case?

Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.

Generally speaking, an index on (column1, column2, column3) can be used for any kind of predicate that can be represented in the following form:

ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)

In other words, if all the records are sorted by column1, column2, column3, the condition should select a contiguous block of such a sorted set of records.

All @value's here can match and can be +Infinity or -Infinity.

Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are INTEGER:

Predicate Lower bound Upper bound Sargable
column1 = 3 3, -Inf, -Inf 3, +Inf, +Inf Yes
column1 = 2 AND column2 < 7 2, -Inf, -Inf 2, 7, +Inf Yes
column1 BETWEEN 3 AND 7
OR (column1 = 2 AND column2 > 5)
OR (column1 = 8 AND column2 = 4 AND column3 < 9)
2, 5, -Inf 8, 4, 9 Yes
column1 > 5 AND column2 = 8 N/A N/A No
column1 = 2 AND column3 = 7 N/A N/A No

If you look closer to these predicates you will see that the sargable predicates always return contiguous blocks of records, while non-sargable can skip some records.

The predicate column1 = 2 AND column3 = 7, for instance, will return (2, 3, 7) and (2, 4, 7) but will skip (2, 3, 8) which is between them.

A condition is sargable if and only if it can be represented in the following form:

ROW(@value1_start, @value2_start, …, @valueN_start) < ROW(column1, column2, …, columnN) < ROW(@value1_end, @value2_end, …, @valueN_end)

Now, returning to your table.

If you have 18 columns and want any possible combination of equality conditions in the WHERE clause to be sargable, you will have to create so many indexes than for any possible combination of your columns there will be an index which will have these columns leading.

Russian mathematician Egor Timoshenko recently proved the theorem that the minimal number of indexes satisifying this condition is equal to C(n/2, n) (see the complete proof in original Russian publication, or in English translation)

For 18 columns, therefore, you will need 48620 indexes to cover all possible combinations.

Bet you didn't expect this answer. Sure, there is some place for optimization.

The point of indexes is limiting the range of examined records so that WHERE condition needs to be cheked on less records.

If the whole set of conditions is sargable in the WHERE clause, no additional checking needs to be performed.

If only a part of the condition is sargable, it's also not a big problem: the optimizer will use the index for coarse filtering and then will just examine all records inside the index range to perform exact checking.

This means that if you have some columns with high selectivity, there is no need to include other columns into the indexes covering them.

Let's create a sample table and see what I mean:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_indexer (
        id INT NOT NULL PRIMARY KEY,
        col1 INT NOT NULL,
        col2 INT NOT NULL,
        col3 INT NOT NULL,
        col4 INT NOT NULL,
        col5 INT NOT NULL,
        col6 INT NOT NULL,
        value INT NOT NULL,
        KEY ix_indexer_21 (col2, col1),
        KEY ix_indexer_3 (col3),
        KEY ix_indexer_4 (col4),
        KEY ix_indexer_5 (col5),
        KEY ix_indexer_6 (col6)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_indexer
SELECT  id,
        RAND(20090509) * 5,
        RAND(20090509 << 1) * 50,
        RAND(20090509 << 2) * 500,
        RAND(20090509 << 3) * 5000,
        RAND(20090509 << 4) * 50000,
        RAND(20090509 << 5) * 500000,
        RAND(20090509 << 6) * 100
FROM    filler;

This table has 6 random-filled columns, col1 being least selective, while col6 being most selective.

Let's query some data:

SELECT  *
FROM    t_indexer
WHERE   col2 = 6
        AND col3 = 54
        AND col4 = 8

id col1 col2 col3 col4 col5 col6 value
264214 3 6 54 8 44144 341871 29
1 row fetched in 0.0002s (0.0053s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_indexer index_merge ix_indexer_21,ix_indexer_3,ix_indexer_4 ix_indexer_4,ix_indexer_3 4,4 1 100.00 Using intersect(ix_indexer_4,ix_indexer_3); Using where

This query uses 3 conditions which are covered by 3 different indexes. But since col3 and col4 are very selective, the optimizer decided to use two separate indexes on col3 and col4 and then find the intersection of the results.

Note that it didn't even bother to use an index on col2: col2 is non-selective, and it's more efficient to examine it in the results filtered by other, selective indexes.

For a composite index to be of use, the columns it consists of should be non-selective, i. e. they should have few DISTINCT values.

Summary:

  1. Arrange your columns by selectivity, from least selective (few DISTINCT values) to most selective (many DISTINCT values)

  2. Run this query:

    SELECT  COUNT(*) / COUNT(DISTINCT col1)
    FROM    t_indexer
    
    COUNT(*) / COUNT(DISTINCT col1)
    166666.6667
    1 row fetched in 0.0001s (0.6093s)
    SELECT  COUNT(*) / COUNT(DISTINCT col1, col2)
    FROM    t_indexer
    
    COUNT(*) / COUNT(DISTINCT col1, col2)
    3267.9739
    1 row fetched in 0.0001s (0.9218s)
    SELECT  COUNT(*) / COUNT(DISTINCT col1, col2, col3)
    FROM    t_indexer
    
    COUNT(*) / COUNT(DISTINCT col1, col2, col3)
    6.5796
    1 row fetched in 0.0001s (2.0624s)

    until you start getting results less than 100

  3. For each column set that fit into the COUNT() without hitting the limit of 100, create a set of DISTINCT indexes:

    • 2 columns:

      • 1, 2
      • 2
    • 3 columns:

      • 1, 2, 3
      • 2, 3
      • 3, 1
    • 4 columns (better avoid this):

      • 1, 2, 3, 4
      • 2, 3, 4
      • 3, 1, 4
      • 4, 1, 2
      • 4, 2
      • 4, 3

    Creating more indexes will severely affect your DML performance

  4. For all other columns, create a single index covering only this column.

  5. Check the DML performance. If it's unacceptably poor, get rid of indexes which cover the queries that will be used most rarely until you have acceptable performance

Note that solutions above apply only to efficiency for equality conditions. They don't cover efficiency of ORDER BY and GROUP BY, which is another story.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

May 9th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply