EXPLAIN EXTENDED

How to create fast database queries

Archive for November 10th, 2009

MySQL: average of sums using indexes

Comments enabled. I *really* need your comment

From Stack Overflow:

My table looks like:

person_id car_id miles
1 1 100
1 2 200
2 3 1000
2 4 500

I need to total the miles for each person and then average those totals.

This is a good question to demonstrate how to improve MySQL queries using covering indexes.

The query should return a single value, which is SUM(miles) / COUNT(DISTINCT person_id). Since this query does not relate the fields to each other (only total sum and total count matter), it could be performed using only the indexes on miles and person_id. However, MySQL is not able to do this by default and it requires a little help from developer's side.

Let's create a sample table:

Table creation details

This table has 1,000,000 rows with 1,000 distinct person_id's. There are indexes on miles and person_id.

A plain query would look like this:

SELECT  SUM(miles) / COUNT(DISTINCT person_id) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (2.1468s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE persons ALL 1000066 100.00
select (sum(`20091110_indexes`.`persons`.`miles`) / count(distinct `20091110_indexes`.`persons`.`person_id`)) AS `total` from `20091110_indexes`.`persons`

The query takes more than 2 seconds.

If we look into the query plan we will see that a full table scan is used. The query needs both the values of person_id and miles and no index contains both values. MySQL browses all table records which are quite large.

But despite the fact that the query needs both values, it does not need to correspond the values. It may calculate the SUM first (using an index on miles), and COUNT(DISTINCT person_id) in another query. Since index records are much less in size that the table records, the separate aggregates on them will be calculated much faster.

And we can improve the query even more by replacing COUNT(DISTINCT person_id) with a COUNT over DISTINCT person_id in a subquery. The latter query will use INDEX FOR GROUP BY and therefore be almost instant.

Here's how the new improved query would look like:

SELECT  SUM(miles) /
(
SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT person_id
FROM    persons
) q
) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (0.7344s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY persons index ix_persons_miles 4 1000066 100.00 Using index
2 SUBQUERY Select tables optimized away
3 DERIVED persons range ix_persons_person 4 2021 100.00 Using index for group-by
select (sum(`20091110_indexes`.`persons`.`miles`) / (select count(0) AS `COUNT(*)` from (select distinct `20091110_indexes`.`persons`.`person_id` AS `person_id` from `20091110_indexes`.`persons`) `q`)) AS `total` from `20091110_indexes`.`persons`

This query uses separate indexes to calculate the aggregate values, and by using INDEX FOR GROUP-BY, COUNT(DISTINCT person_id) is calculated almost instantly.

Overall query performance is 750 ms, or thrice as fast as the original query.

Written by Quassnoi

November 10th, 2009 at 11:00 pm

Posted in MySQL