Archive for November 10th, 2009
MySQL: average of sums using indexes
Comments enabled. I *really* need your comment
From Stack Overflow:
My table looks like:
person_id car_id miles 1 1 100 1 2 200 2 3 1000 2 4 500 I need to total the miles for each person and then average those totals.
This is a good question to demonstrate how to improve MySQL queries using covering indexes.
The query should return a single value, which is SUM(miles) / COUNT(DISTINCT person_id)
. Since this query does not relate the fields to each other (only total sum and total count matter), it could be performed using only the indexes on miles
and person_id
. However, MySQL is not able to do this by default and it requires a little help from developer's side.
Let's create a sample table:
This table has 1,000,000 rows with 1,000 distinct person_id
's. There are indexes on miles
and person_id
.
A plain query would look like this:
SELECT SUM(miles) / COUNT(DISTINCT person_id) AS total FROM persons
total | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4998778.6000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.1468s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | persons | ALL | 1000066 | 100.00 |
select (sum(`20091110_indexes`.`persons`.`miles`) / count(distinct `20091110_indexes`.`persons`.`person_id`)) AS `total` from `20091110_indexes`.`persons`
The query takes more than 2 seconds.
If we look into the query plan we will see that a full table scan is used. The query needs both the values of person_id
and miles
and no index contains both values. MySQL browses all table records which are quite large.
But despite the fact that the query needs both values, it does not need to correspond the values. It may calculate the SUM
first (using an index on miles
), and COUNT(DISTINCT person_id)
in another query. Since index records are much less in size that the table records, the separate aggregates on them will be calculated much faster.
And we can improve the query even more by replacing COUNT(DISTINCT person_id)
with a COUNT
over DISTINCT person_id
in a subquery. The latter query will use INDEX FOR GROUP BY
and therefore be almost instant.
Here's how the new improved query would look like:
SELECT SUM(miles) / ( SELECT COUNT(*) FROM ( SELECT DISTINCT person_id FROM persons ) q ) AS total FROM persons
total | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4998778.6000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7344s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | persons | index | ix_persons_miles | 4 | 1000066 | 100.00 | Using index | ||
2 | SUBQUERY | Select tables optimized away | ||||||||
3 | DERIVED | persons | range | ix_persons_person | 4 | 2021 | 100.00 | Using index for group-by |
select (sum(`20091110_indexes`.`persons`.`miles`) / (select count(0) AS `COUNT(*)` from (select distinct `20091110_indexes`.`persons`.`person_id` AS `person_id` from `20091110_indexes`.`persons`) `q`)) AS `total` from `20091110_indexes`.`persons`
This query uses separate indexes to calculate the aggregate values, and by using INDEX FOR GROUP-BY
, COUNT(DISTINCT person_id)
is calculated almost instantly.
Overall query performance is 750 ms, or thrice as fast as the original query.