EXPLAIN EXTENDED

How to create fast database queries

Choosing index

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table with roughly 100,000 blog postings, linked to a table with 50 feeds via an 1:n relationship.

When I query both tables with a SELECT statement, ordered by a datetime field of the postings table, MySQL always uses filesort, resulting in very slow query times (more than 1 second).

SELECT
`postings`.`id`,
UNIX_TIMESTAMP(postings.post_date) as post_date,
`postings`.`link`,
`postings`.`title`,
`postings`.`author`,
`postings`.`excerpt`,
`postings`.`long_excerpt`,
`feeds`.`title` AS feed_title,
`feeds`.`website` AS feed_website
FROM
`postings`
JOIN
`feeds`
ON
`feeds`.`id` = `postings`.`feed_id`
WHERE
`feeds`.`type` = 1 AND
`postings`.`user_offtopic_count` < 10 AND 
    `postings`.`is_active` = 1
ORDER BY 
    `postings`.`post_date` desc
LIMIT 
    15

This is a nice question that illustrates how to choose which indexes to create.

At first blush it seems to be a job for a composite index on (is_active, user_offtopic_count, post_date). Really, we could use the first two fields to filter on and the third field to order by.

But the problem here is that the query uses a range condition on postings.user_offtopic_count. In this case, the index cannot be used for ordering.

To illustrate what's going on, we will create two identical tables called t_indexer and t_indexer2.

In the first table (t_indexer), active rows comprise 0.1% of total rows, and offtopic's are evenly distibuted from 0 to 99, thus 10% of rows have offtopic's less than 10. Overall filter selectivity is, therefore, 0.01%, or 1 / 10000.

In the second table (t_indexer2), active rows comprise 80% of total rows, and rows with offtopic less than 10 comprise 90% of rows. Filter selectivity is, therefore, 72%, or 7200 / 10000.

Both tables have 262,144 rows.

Table creation details

Let's see what does the composite index on (active, offtopic, timestamp) look like.

If we filter

SELECT  *
FROM    t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp)
WHERE   active = 1
AND offtopic = 0
id active offtopic timestamp
180750 1 0 2009-01-26 19:24:23
22055 1 0 2009-02-07 15:08:41
226297 1 0 2009-03-16 16:26:47
3 rows fetched in 0.0002s (0.0021s)

The rows are ordered by active, then by offtopic, then by timestamp, that means that within each given set of (active, offtopic) rows are ordered by timestamp.

MySQL knows it. And when the table is being filtered by a single set of (active, offtopic) and then ordered by timestamp, MySQL does not use sorting, as the values come ordered right out of the index:

EXPLAIN EXTENDED
SELECT  *
FROM    t_indexer
WHERE   active = 1
AND offtopic = 0
ORDER BY
timestamp
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_indexer ref ix_indexer_active_offtopic,ix_indexer_active_timestamp,ix_indexer_active_offtopic_timestamp ix_indexer_active_offtopic_timestamp 5 const,const 3 100.00 Using where; Using index
1 row fetched in 0.0002s (0.0038s)

Note no filesort in EXPLAIN EXTENDED above.

But if we use a range condition of offtopic, then we get the following results:

SELECT  *
FROM    t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp)
WHERE   active = 1
AND offtopic <= 3

id active offtopic timestamp
180750 1 0 2009-01-26 19:24:23
22055 1 0 2009-02-07 15:08:41
226297 1 0 2009-03-16 16:26:47
72144 1 1 2008-12-15 05:56:23
87515 1 1 2009-01-11 11:05:05
120727 1 1 2009-02-18 07:32:52
11743 1 2 2009-02-21 08:02:14
160566 1 3 2009-03-20 07:29:01
8 rows fetched in 0.0003s (0.0020s)

As you can see, there is no more a common order of timestamp's.

That's because timestamps are ordered only within a single chunk of (active, offtopic) and with a ranged filtering condition several chunks are selected. The order of timestamps is not preserved between the chunks and the rows should be sorted:

EXPLAIN EXTENDED
SELECT  *
FROM    t_indexer FORCE INDEX (ix_indexer_active_offtopic_timestamp)
WHERE   active = 1
AND offtopic <= 3
ORDER BY
        timestamp

id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_indexer range ix_indexer_active_offtopic_timestamp ix_indexer_active_offtopic_timestamp 5 8 100.00 Using where; Using index; Using filesort
1 row fetched in 0.0006s (0.0028s)

So what can we do in this case?

With a ranged condition, we can benefit either from fast filtering or from fast ordering, but not from both.

A certain index should be created for each of these cases:

  1. KEY ix_indexer_active_offtopic (active, offtopic)

    This one can be used for fast filtering on a range condition.

  2. KEY ix_indexer_active_timestamp (active, timestamp)

    This one keeps overall ordering on timestamp and can be used for fast ordering.

Which one is faster, depends on the following parameters:

  1. Filter selectivity.

    Filtering out n rows takes constant time if no index restrictions are used, and just Θ(log(n)) if an index is used.

  2. Sorting complexity.

    Sorting takes Θ(n log(n)) time if performed in memory and more if disk access is used.

    Note, though, that filesort in most cases has nothing to do with files: the files are beign used only if there is not enough memory.

    For out task (200,000 rows at most), it will most probably fit into memory and no disk access will be required.

Now, let's check which index is better for which table:

SELECT  SUM(id)
FROM    (
SELECT  id
FROM    t_indexer FORCE INDEX (ix_indexer_active_offtopic)
WHERE   active = 1
AND offtopic < 10
        ORDER BY
                timestamp
        ) q
SUM(id)
4304958
1 row fetched in 0.0001s (0.0016s)
SELECT  SUM(id)
FROM    (
SELECT  id
FROM    t_indexer FORCE INDEX (ix_indexer_active_timestamp)
WHERE   active = 1
AND offtopic < 10
        ORDER BY
                timestamp
        ) q
SUM(id)
4304958
1 row fetched in 0.0001s (0.0047s)

It can be easily seen that on this very selective condition, we benefit from range access more than from absence of filesort.

Let's see what's for the other table:

SELECT  SUM(id)
FROM    (
SELECT  id
FROM    t_indexer2 FORCE INDEX (ix_indexer_active_timestamp)
WHERE   active = 1
AND offtopic < 10
        ORDER BY
                timestamp
        ) q
SUM(id)
25048211173
1 row fetched in 0.0001s (0.7375s)
SELECT  SUM(id)
FROM    (
SELECT  id
FROM    t_indexer2 FORCE INDEX (ix_indexer_active_offtopic)
WHERE   active = 1
AND offtopic < 10
        ORDER BY
                timestamp
        ) q
SUM(id)
25048211173
1 row fetched in 0.0001s (1.0156s)

Now, the condition is not selective, and there are lots of rows to sort.

In this case, we benefit from having preordered data more than from the range access.

Summary

In a query with a ranged condition on one field and ordering by other field:

  • If your range condition is selective (few rows satisfy this condition), it's better to create an index on the filtering field.
  • If your range condition is not selective (lots of rows satisfy this condition), it's better to create an index on the ordering field.

Written by Quassnoi

April 1st, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply