MySQL: splitting aggregate queries at EXPLAIN EXTENDED

MySQL: splitting aggregate queries

Answering questions asked on the site.

Victor asks:

I have a table which I will call sale to protect the innocent:

Sale

id product price amount date

I need to retrieve ultimate values of price, amount and date for each product:
SELECT  product,
MIN(price), MAX(price),
MIN(amount), MAX(amount),
MIN(date), MAX(date)
FROM    sale
GROUP BY
product
The query only returns about 100 records.

I have all these fields indexed (together with product), but this still produces a full table scan over 3,000,000 records.

How do I speed up this query?

To retrieve the ultimate values of the fields, MySQL would just need to make a loose index scan over each index and take the max and min value of the field for each product.

However, the optimizer won't do when multiple indexes are involved. Instead, it will revert to a full scan.

There is a workaround for it. Let's create a sample table and see them:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE sale
(
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
product INT NOT NULL,
amount INT NOT NULL,
price NUMERIC(20, 2) NOT NULL,
dt DATETIME NOT NULL,
KEY ix_sale_product_amount (product, amount),
KEY ix_sale_product_price (product, price),
KEY ix_sale_product_dt (product, dt)
)
ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(50000);
COMMIT;

INSERT
INTO    sale (product, amount, price, dt)
SELECT  CEILING(RAND(20110227) * 100),
        CEILING(RAND(20110227 &lt;&lt; 1) * 1000) + 100,
        CEILING(RAND(20110227 &lt;&lt; 1 + 1) * 10000) / 100.00 + 20.00,
        '2011-03-27' - INTERVAL CEILING(RAND(20110227 &lt;&lt; 1 + 2) * 10000000) SECOND
FROM    filler
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 60
        ) q;

This table contains 3M random records for 100 distinct products and is indexed appropriately.

Here's a straightforward query:

SELECT  product,
MIN(price), MAX(price),
MIN(amount), MAX(amount),
MIN(dt), MAX(dt)
FROM    sale
GROUP BY
product

product	MIN(price)	MAX(price)	MIN(amount)	MAX(amount)	MIN(dt)	MAX(dt)
1	20.01	120.00	101	1100	2010-12-01 06:14:22	2011-03-26 23:42:32
2	20.01	120.00	101	1100	2010-12-01 06:39:38	2011-03-26 23:58:25
3	20.01	120.00	101	1100	2010-12-01 06:13:34	2011-03-26 23:54:32
4	20.01	120.00	101	1100	2010-12-01 06:14:30	2011-03-26 23:58:45

100	20.01	120.00	101	1100	2010-12-01 06:32:58	2011-03-26 23:58:24
100 rows fetched in 0.0052s (13.8279s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	sale	index		ix_sale_product_amount	8		3000409	100.00

select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`

As was expected, the query uses a full table scan (or, rather, an index scan) and takes almost 14 seconds to complete.

In order to make the engine use three separate indexes, we would need to split the query into three queries each searching for ultimate values on one column, and then combine the results.

Each of these queries would be a MIN / MAX query on a trailing column of a composite index, combined with a GROUP BY on the index's leading columns, and, as such, would be subject to loose index scan optimization.

To combine the results, we will of course just join them on product. This will be quite efficient too since the resultsets are going to be quite small (100 records each, exactly).

And here is the query:

SELECT  *
FROM    (
SELECT  product, MIN(amount), MAX(amount)
FROM    sale
GROUP BY
product
) qa
NATURAL JOIN
(
SELECT  product, MIN(price), MAX(price)
FROM    sale
GROUP BY
product
) qp
NATURAL JOIN
(
SELECT  product, MIN(dt), MAX(dt)
FROM    sale
GROUP BY
product
) qd

product	MIN(amount)	MAX(amount)	MIN(price)	MAX(price)	MIN(dt)	MAX(dt)
1	101	1100	20.01	120.00	2010-12-01 06:14:22	2011-03-26 23:42:32
2	101	1100	20.01	120.00	2010-12-01 06:39:38	2011-03-26 23:58:25
3	101	1100	20.01	120.00	2010-12-01 06:13:34	2011-03-26 23:54:32
4	101	1100	20.01	120.00	2010-12-01 06:14:30	2011-03-26 23:58:45

100	101	1100	20.01	120.00	2010-12-01 06:32:58	2011-03-26 23:58:24
100 rows fetched in 0.0053s (0.0105s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL			100	100.00
1	PRIMARY	<derived3>	ALL			100	100.00	Using where; Using join buffer
1	PRIMARY	<derived4>	ALL			100	100.00	Using where; Using join buffer
4	DERIVED	sale	range	ix_sale_product_dt	4	19	100.00	Using index for group-by
3	DERIVED	sale	range	ix_sale_product_price	4	19	100.00	Using index for group-by
2	DERIVED	sale	range	ix_sale_product_amount	4	1067	100.00	Using index for group-by

select `qa`.`product` AS `product`,`qa`.`MIN(amount)` AS `MIN(amount)`,`qa`.`MAX(amount)` AS `MAX(amount)`,`qp`.`MIN(price)` AS `MIN(price)`,`qp`.`MAX(price)` AS `MAX(price)`,`qd`.`MIN(dt)` AS `MIN(dt)`,`qd`.`MAX(dt)` AS `MAX(dt)` from (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qa` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qp` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qd` where ((`qp`.`product` = `qa`.`product`) and (`qd`.`product` = `qa`.`product`))

Each individual query is executed with using index for group-by, that is jumping over the min and max values in a loose index scan.

The queries are combined in a join which, despite not using the indexes, is still very fast because only 100 records are joined (of course fitting into a join buffer).

The overall query time is only 10 ms.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

March 28th, 2011 at 11:00 pm

Posted in MySQL

2 Responses to 'MySQL: splitting aggregate queries'

Subscribe to comments with RSS

Learning quite a few things while surfing your page.
You’re an SQL-Wizzard ^^
Keep up the good work!

sneils

18 Jun 12 at 12:55
I did know this technique, but I’m glad I read your post, because now I’m very interested in the methods you use to populate test tables. I hope you’ll post about this topic in the future.

Federico

11 Jan 13 at 15:21

EXPLAIN EXTENDED