Creating indexes
Comments enabled. I *really* need your comment
Answering the questions asked on the site.
Dima asks:
I have a single rather large table with 18 columns.
Data in a table are continuous time stamped records.
On one hand, the application is appending new records to this table; on the other hand, users can do various queries.
My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?
There could be all sorts of combinations in
WHERE
clause, I don't want to limit users in their choices.Is there any common sense strategy to decide on indexes in my case?
Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.
Generally speaking, an index on (column1, column2, column3)
can be used for any kind of predicate that can be represented in the following form:
ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)
In other words, if all the records are sorted by column1, column2, column3
, the condition should select a contiguous block of such a sorted set of records.
All @value
's here can match and can be +Infinity
or -Infinity
.
Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are INTEGER
:
Predicate | Lower bound | Upper bound | Sargable |
---|---|---|---|
column1 = 3 |
3, -Inf, -Inf |
3, +Inf, +Inf |
Yes |
column1 = 2 AND column2 < 7 |
2, -Inf, -Inf |
2, 7, +Inf |
Yes |
column1 BETWEEN 3 AND 7 |
2, 5, -Inf |
8, 4, 9 |
Yes |
column1 > 5 AND column2 = 8 |
N/A | N/A | No |
column1 = 2 AND column3 = 7 |
N/A | N/A | No |
If you look closer to these predicates you will see that the sargable predicates always return contiguous blocks of records, while non-sargable can skip some records.
The predicate column1 = 2 AND column3 = 7
, for instance, will return (2, 3, 7)
and (2, 4, 7)
but will skip (2, 3, 8)
which is between them.
A condition is sargable if and only if it can be represented in the following form:
ROW(@value1_start, @value2_start, …, @valueN_start) < ROW(column1, column2, …, columnN) < ROW(@value1_end, @value2_end, …, @valueN_end)
Now, returning to your table.
If you have 18 columns and want any possible combination of equality conditions in the WHERE
clause to be sargable, you will have to create so many indexes than for any possible combination of your columns there will be an index which will have these columns leading.
Russian mathematician Egor Timoshenko recently proved the theorem that the minimal number of indexes satisifying this condition is equal to C(n/2, n)
(see the complete proof in original Russian publication, or in English translation)
For 18 columns, therefore, you will need 48620 indexes to cover all possible combinations.
Bet you didn't expect this answer. Sure, there is some place for optimization.
The point of indexes is limiting the range of examined records so that WHERE
condition needs to be cheked on less records.
If the whole set of conditions is sargable in the WHERE
clause, no additional checking needs to be performed.
If only a part of the condition is sargable, it's also not a big problem: the optimizer will use the index for coarse filtering and then will just examine all records inside the index range to perform exact checking.
This means that if you have some columns with high selectivity, there is no need to include other columns into the indexes covering them.
Let's create a sample table and see what I mean:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_indexer ( id INT NOT NULL PRIMARY KEY, col1 INT NOT NULL, col2 INT NOT NULL, col3 INT NOT NULL, col4 INT NOT NULL, col5 INT NOT NULL, col6 INT NOT NULL, value INT NOT NULL, KEY ix_indexer_21 (col2, col1), KEY ix_indexer_3 (col3), KEY ix_indexer_4 (col4), KEY ix_indexer_5 (col5), KEY ix_indexer_6 (col6) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_indexer SELECT id, RAND(20090509) * 5, RAND(20090509 << 1) * 50, RAND(20090509 << 2) * 500, RAND(20090509 << 3) * 5000, RAND(20090509 << 4) * 50000, RAND(20090509 << 5) * 500000, RAND(20090509 << 6) * 100 FROM filler;
This table has 6 random-filled columns, col1
being least selective, while col6
being most selective.
Let's query some data:
SELECT * FROM t_indexer WHERE col2 = 6 AND col3 = 54 AND col4 = 8
id | col1 | col2 | col3 | col4 | col5 | col6 | value | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
264214 | 3 | 6 | 54 | 8 | 44144 | 341871 | 29 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0053s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_indexer | index_merge | ix_indexer_21,ix_indexer_3,ix_indexer_4 | ix_indexer_4,ix_indexer_3 | 4,4 | 1 | 100.00 | Using intersect(ix_indexer_4,ix_indexer_3); Using where |
This query uses 3 conditions which are covered by 3 different indexes. But since col3
and col4
are very selective, the optimizer decided to use two separate indexes on col3
and col4
and then find the intersection of the results.
Note that it didn't even bother to use an index on col2
: col2
is non-selective, and it's more efficient to examine it in the results filtered by other, selective indexes.
For a composite index to be of use, the columns it consists of should be non-selective, i. e. they should have few DISTINCT
values.
Summary:
- Arrange your columns by selectivity, from least selective (few
DISTINCT
values) to most selective (manyDISTINCT
values) - Run this query:
SELECT COUNT(*) / COUNT(DISTINCT col1) FROM t_indexer
COUNT(*) / COUNT(DISTINCT col1) 166666.6667 1 row fetched in 0.0001s (0.6093s) SELECT COUNT(*) / COUNT(DISTINCT col1, col2) FROM t_indexer
COUNT(*) / COUNT(DISTINCT col1, col2) 3267.9739 1 row fetched in 0.0001s (0.9218s) SELECT COUNT(*) / COUNT(DISTINCT col1, col2, col3) FROM t_indexer
COUNT(*) / COUNT(DISTINCT col1, col2, col3) 6.5796 1 row fetched in 0.0001s (2.0624s) until you start getting results less than 100
- For each column set that fit into the
COUNT()
without hitting the limit of 100, create a set ofDISTINCT
indexes:- 2 columns:
- 1, 2
- 2
- 3 columns:
- 1, 2, 3
- 2, 3
- 3, 1
- 4 columns (better avoid this):
- 1, 2, 3, 4
- 2, 3, 4
- 3, 1, 4
- 4, 1, 2
- 4, 2
- 4, 3
Creating more indexes will severely affect your
DML
performance - 2 columns:
- For all other columns, create a single index covering only this column.
- Check the
DML
performance. If it's unacceptably poor, get rid of indexes which cover the queries that will be used most rarely until you have acceptable performance
Note that solutions above apply only to efficiency for equality conditions. They don't cover efficiency of ORDER BY
and GROUP BY
, which is another story.
Hope that helps.
I'm always glad to answer the questions regarding database queries.