Choosing index
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table with roughly 100,000 blog postings, linked to a table with 50 feeds via an
1:n
relationship.When I query both tables with a
SELECT
statement, ordered by a datetime field of the postings table, MySQL always usesfilesort
, resulting in very slow query times (more than 1 second).SELECT `postings`.`id`, UNIX_TIMESTAMP(postings.post_date) as post_date, `postings`.`link`, `postings`.`title`, `postings`.`author`, `postings`.`excerpt`, `postings`.`long_excerpt`, `feeds`.`title` AS feed_title, `feeds`.`website` AS feed_website FROM `postings` JOIN `feeds` ON `feeds`.`id` = `postings`.`feed_id` WHERE `feeds`.`type` = 1 AND `postings`.`user_offtopic_count` < 10 AND `postings`.`is_active` = 1 ORDER BY `postings`.`post_date` desc LIMIT 15
This is a nice question that illustrates how to choose which indexes to create.
At first blush it seems to be a job for a composite index on (is_active, user_offtopic_count, post_date)
. Really, we could use the first two fields to filter on and the third field to order by.
But the problem here is that the query uses a range
condition on postings.user_offtopic_count
. In this case, the index cannot be used for ordering.
To illustrate what's going on, we will create two identical tables called t_indexer
and t_indexer2
.
In the first table (t_indexer
), active
rows comprise 0.1% of total rows, and offtopic
's are evenly distibuted from 0 to 99, thus 10% of rows have offtopic
's less than 10. Overall filter selectivity is, therefore, 0.01%, or 1 / 10000.
In the second table (t_indexer2
), active
rows comprise 80% of total rows, and rows with offtopic
less than 10 comprise 90% of rows. Filter selectivity is, therefore, 72%, or 7200 / 10000.
Both tables have 262,144 rows.
Let's see what does the composite index on (active, offtopic, timestamp)
look like.
If we filter
SELECT * FROM t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp) WHERE active = 1 AND offtopic = 0
id | active | offtopic | timestamp | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
180750 | 1 | 0 | 2009-01-26 19:24:23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
22055 | 1 | 0 | 2009-02-07 15:08:41 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
226297 | 1 | 0 | 2009-03-16 16:26:47 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0002s (0.0021s) |
The rows are ordered by active
, then by offtopic
, then by timestamp
, that means that within each given set of (active, offtopic)
rows are ordered by timestamp
.
MySQL knows it. And when the table is being filtered by a single set of (active, offtopic)
and then ordered by timestamp
, MySQL does not use sorting, as the values come ordered right out of the index:
EXPLAIN EXTENDED SELECT * FROM t_indexer WHERE active = 1 AND offtopic = 0 ORDER BY timestamp
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_indexer | ref | ix_indexer_active_offtopic,ix_indexer_active_timestamp,ix_indexer_active_offtopic_timestamp | ix_indexer_active_offtopic_timestamp | 5 | const,const | 3 | 100.00 | Using where; Using index | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0038s) |
Note no filesort in EXPLAIN EXTENDED
above.
But if we use a range
condition of offtopic
, then we get the following results:
SELECT * FROM t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp) WHERE active = 1 AND offtopic <= 3
id | active | offtopic | timestamp | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
180750 | 1 | 0 | 2009-01-26 19:24:23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
22055 | 1 | 0 | 2009-02-07 15:08:41 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
226297 | 1 | 0 | 2009-03-16 16:26:47 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
72144 | 1 | 1 | 2008-12-15 05:56:23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87515 | 1 | 1 | 2009-01-11 11:05:05 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
120727 | 1 | 1 | 2009-02-18 07:32:52 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11743 | 1 | 2 | 2009-02-21 08:02:14 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
160566 | 1 | 3 | 2009-03-20 07:29:01 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0003s (0.0020s) |
As you can see, there is no more a common order of timestamp
's.
That's because timestamps are ordered only within a single chunk of (active, offtopic)
and with a ranged filtering condition several chunks are selected. The order of timestamps is not preserved between the chunks and the rows should be sorted:
EXPLAIN EXTENDED SELECT * FROM t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp) WHERE active = 1 AND offtopic <= 3 ORDER BY timestamp
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_indexer | range | ix_indexer_active_offtopic_timestamp | ix_indexer_active_offtopic_timestamp | 5 | 8 | 100.00 | Using where; Using index; Using filesort | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0006s (0.0028s) |
So what can we do in this case?
With a ranged condition, we can benefit either from fast filtering or from fast ordering, but not from both.
A certain index should be created for each of these cases:
KEY ix_indexer_active_offtopic (active, offtopic)
This one can be used for fast filtering on a
range
condition.KEY ix_indexer_active_timestamp (active, timestamp)
This one keeps overall ordering on
timestamp
and can be used for fast ordering.
Which one is faster, depends on the following parameters:
- Filter selectivity.
Filtering out
n
rows takes constant time if no index restrictions are used, and justΘ(log(n))
if an index is used. - Sorting complexity.
Sorting takes
Θ(n log(n))
time if performed in memory and more if disk access is used.Note, though, that
filesort
in most cases has nothing to do with files: the files are beign used only if there is not enough memory.For out task (200,000 rows at most), it will most probably fit into memory and no disk access will be required.
Now, let's check which index is better for which table:
SELECT SUM(id) FROM ( SELECT id FROM t_indexer FORCE INDEX (ix_indexer_active_offtopic) WHERE active = 1 AND offtopic < 10 ORDER BY timestamp ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4304958 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0016s) |
SELECT SUM(id) FROM ( SELECT id FROM t_indexer FORCE INDEX (ix_indexer_active_timestamp) WHERE active = 1 AND offtopic < 10 ORDER BY timestamp ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4304958 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0047s) |
It can be easily seen that on this very selective condition, we benefit from range
access more than from absence of filesort
.
Let's see what's for the other table:
SELECT SUM(id) FROM ( SELECT id FROM t_indexer2 FORCE INDEX (ix_indexer_active_timestamp) WHERE active = 1 AND offtopic < 10 ORDER BY timestamp ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25048211173 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7375s) |
SELECT SUM(id) FROM ( SELECT id FROM t_indexer2 FORCE INDEX (ix_indexer_active_offtopic) WHERE active = 1 AND offtopic < 10 ORDER BY timestamp ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25048211173 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (1.0156s) |
Now, the condition is not selective, and there are lots of rows to sort.
In this case, we benefit from having preordered data more than from the range
access.
Summary
In a query with a ranged condition on one field and ordering by other field:
- If your range condition is selective (few rows satisfy this condition), it's better to create an index on the filtering field.
- If your range condition is not selective (lots of rows satisfy this condition), it's better to create an index on the ordering field.