MySQL: splitting aggregate queries
Answering questions asked on the site.
Victor asks:
I have a table which I will call
sale
to protect the innocent:
Sale id product price amount date I need to retrieve ultimate values of
price
,amount
anddate
for each product:SELECT product, MIN(price), MAX(price), MIN(amount), MAX(amount), MIN(date), MAX(date) FROM sale GROUP BY productThe query only returns about 100 records.
I have all these fields indexed (together with
product
), but this still produces a full table scan over 3,000,000 records.How do I speed up this query?
To retrieve the ultimate values of the fields, MySQL would just need to make a loose index scan over each index and take the max and min value of the field for each product
.
However, the optimizer won't do when multiple indexes are involved. Instead, it will revert to a full scan.
There is a workaround for it. Let's create a sample table and see them:
This table contains 3M random records for 100 distinct products and is indexed appropriately.
Here's a straightforward query:
SELECT product, MIN(price), MAX(price), MIN(amount), MAX(amount), MIN(dt), MAX(dt) FROM sale GROUP BY product
product | MIN(price) | MAX(price) | MIN(amount) | MAX(amount) | MIN(dt) | MAX(dt) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 20.01 | 120.00 | 101 | 1100 | 2010-12-01 06:14:22 | 2011-03-26 23:42:32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 20.01 | 120.00 | 101 | 1100 | 2010-12-01 06:39:38 | 2011-03-26 23:58:25 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 20.01 | 120.00 | 101 | 1100 | 2010-12-01 06:13:34 | 2011-03-26 23:54:32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 20.01 | 120.00 | 101 | 1100 | 2010-12-01 06:14:30 | 2011-03-26 23:58:45 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100 | 20.01 | 120.00 | 101 | 1100 | 2010-12-01 06:32:58 | 2011-03-26 23:58:24 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100 rows fetched in 0.0052s (13.8279s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | sale | index | ix_sale_product_amount | 8 | 3000409 | 100.00 |
select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`
As was expected, the query uses a full table scan (or, rather, an index scan) and takes almost 14 seconds to complete.
In order to make the engine use three separate indexes, we would need to split the query into three queries each searching for ultimate values on one column, and then combine the results.
Each of these queries would be a MIN / MAX
query on a trailing column of a composite index, combined with a GROUP BY
on the index's leading columns, and, as such, would be subject to loose index scan optimization.
To combine the results, we will of course just join them on product
. This will be quite efficient too since the resultsets are going to be quite small (100 records each, exactly).
And here is the query:
SELECT * FROM ( SELECT product, MIN(amount), MAX(amount) FROM sale GROUP BY product ) qa NATURAL JOIN ( SELECT product, MIN(price), MAX(price) FROM sale GROUP BY product ) qp NATURAL JOIN ( SELECT product, MIN(dt), MAX(dt) FROM sale GROUP BY product ) qd
product | MIN(amount) | MAX(amount) | MIN(price) | MAX(price) | MIN(dt) | MAX(dt) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 101 | 1100 | 20.01 | 120.00 | 2010-12-01 06:14:22 | 2011-03-26 23:42:32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 101 | 1100 | 20.01 | 120.00 | 2010-12-01 06:39:38 | 2011-03-26 23:58:25 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 101 | 1100 | 20.01 | 120.00 | 2010-12-01 06:13:34 | 2011-03-26 23:54:32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 101 | 1100 | 20.01 | 120.00 | 2010-12-01 06:14:30 | 2011-03-26 23:58:45 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100 | 101 | 1100 | 20.01 | 120.00 | 2010-12-01 06:32:58 | 2011-03-26 23:58:24 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100 rows fetched in 0.0053s (0.0105s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 100 | 100.00 | |||||
1 | PRIMARY | <derived3> | ALL | 100 | 100.00 | Using where; Using join buffer | ||||
1 | PRIMARY | <derived4> | ALL | 100 | 100.00 | Using where; Using join buffer | ||||
4 | DERIVED | sale | range | ix_sale_product_dt | 4 | 19 | 100.00 | Using index for group-by | ||
3 | DERIVED | sale | range | ix_sale_product_price | 4 | 19 | 100.00 | Using index for group-by | ||
2 | DERIVED | sale | range | ix_sale_product_amount | 4 | 1067 | 100.00 | Using index for group-by |
select `qa`.`product` AS `product`,`qa`.`MIN(amount)` AS `MIN(amount)`,`qa`.`MAX(amount)` AS `MAX(amount)`,`qp`.`MIN(price)` AS `MIN(price)`,`qp`.`MAX(price)` AS `MAX(price)`,`qd`.`MIN(dt)` AS `MIN(dt)`,`qd`.`MAX(dt)` AS `MAX(dt)` from (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qa` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qp` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qd` where ((`qp`.`product` = `qa`.`product`) and (`qd`.`product` = `qa`.`product`))
Each individual query is executed with using index for group-by
, that is jumping over the min and max values in a loose index scan.
The queries are combined in a join which, despite not using the indexes, is still very fast because only 100 records are joined (of course fitting into a join buffer).
The overall query time is only 10 ms.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Learning quite a few things while surfing your page.
You’re an SQL-Wizzard ^^
Keep up the good work!
sneils
18 Jun 12 at 12:55
I did know this technique, but I’m glad I read your post, because now I’m very interested in the methods you use to populate test tables. I hope you’ll post about this topic in the future.
Federico
11 Jan 13 at 15:21