From Stack Overflow:
I have a table product
with products and table sale
with all sale operations that were done on these products.
I would like to get 10 most often sold products today and what I did is this:
SELECT p.*, COUNT(s.id) AS sumsell
FROM product p
LEFT JOIN
sale s
ON s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-02'
GROUP BY
p.id
ORDER BY
sumsell DESC
LIMIT 10
, but performance of it is very slow.
What can I do to increase performance of this particular query?
The query involves a LEFT JOIN
which in MySQL world means that products
will be made leading in the query. Each record of product
will be taken and checked against sale
table to find out the number of matching records. If no matching records are found, 0 is returned.
Let's create the sample tables:
Table creation details
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
DELIMITER $$
CREATE TABLE product (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE sale (
id INT NOT NULL PRIMARY KEY,
product_id INT NOT NULL,
amount FLOAT NOT NULL,
dt DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(500000);
COMMIT;
INSERT
INTO product
SELECT id, CONCAT('Product ', id)
FROM filler;
INSERT
INTO sale (id, product_id, amount, dt)
SELECT id,
FLOOR(RAND(20100305) * 500000) + 1,
RAND(20100305 << 1) * 100 + 1,
'2010-03-06' - INTERVAL id MINUTE
FROM filler;
CREATE INDEX ix_sale_product_dt ON sale (product_id, dt);
CREATE INDEX ix_sale_dt_product ON sale (dt, product_id);
The table contains 500,000 products and 500,000 random sales (1,440 sales per day).
Now, let's run the query similar to the author's one. I adjusted the period so that fewer than 10 actual sales were made during the period and LEFT JOIN
records can be seen in the table:
SELECT p.*, COUNT(s.id) AS sumsell
FROM product p
LEFT JOIN
sale s
ON s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-01 00:07:00'
GROUP BY
p.id
ORDER BY
sumsell DESC, p.id
LIMIT 10
View query details
id |
name |
sumsell |
50630 |
Product 50630 |
1 |
143395 |
Product 143395 |
1 |
222114 |
Product 222114 |
1 |
322966 |
Product 322966 |
1 |
340133 |
Product 340133 |
1 |
345789 |
Product 345789 |
1 |
462937 |
Product 462937 |
1 |
1 |
Product 1 |
0 |
2 |
Product 2 |
0 |
3 |
Product 3 |
0 |
10 rows fetched in 0.0009s (3.0312s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
p |
index |
|
PRIMARY |
4 |
|
10 |
5007270.00 |
Using temporary; Using filesort |
1 |
SIMPLE |
s |
ref |
ix_sale_product_dt,ix_sale_dt_product |
ix_sale_product_dt |
4 |
20100305_left.p.id |
1 |
100.00 |
Using index |
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,count(`20100305_left`.`s`.`id`) AS `sumsell` from `20100305_left`.`product` `p` left join `20100305_left`.`sale` `s` on(((`20100305_left`.`s`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`s`.`dt` >= '2010-01-01') and (`20100305_left`.`s`.`dt` < '2010-01-01 00:07:00'))) where 1 group by `20100305_left`.`p`.`id` order by count(`20100305_left`.`s`.`id`) desc,`20100305_left`.`p`.`id` limit 10
The query runs for 3 seconds.
We see that, first, product
is made leading, and, second, only a part of the index on sale (product, dt)
is used: each sale is only filtered on product, not on date.
Since there were only 7 sales during the period we have chosen, it would be a wise decision to make sale
leading in the join so that it could be filtered on date and the resulting recordset was then joined to product
. This would result in at most 7 PRIMARY KEY
seeks instead of 500,000 range scans and would be much more efficient.
However, this is only possible with the INNER JOIN
, and if there are less then 10 products sold within the time period, we will not see the rest.
To work around this, we need to emulate the LEFT JOIN
:
- Find the products sold within the time period, using an
INNER JOIN
of product
with the resultset containg aggregated sales.
- Find the products not sold within the time period, using
NOT EXISTS
predicate.
- Concatenate the two resultsets using
UNION ALL
.
The step 2 implies that product
is leading again, so normally it would not be much of improvement. But in our case, we don't need the whole recordset, we only need the top 10 sales.
So we can just order and limit the recordsets retrieved on steps 1 and 2 to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.
The second resultset will contain a hardcoded 0 in the sumsell
, so we just need to order it on product.id
. Since product
is an InnoDB table and product.id
is a clustered PRIMARY KEY
, this is not a problem.
Here's the query:
SELECT p.*, sumsell
FROM (
SELECT *
FROM (
SELECT product_id, sumsell
FROM (
SELECT product_id, COUNT(*) AS sumsell
FROM sale si
WHERE dt >= '2010-01-01'
AND dt < '2010-01-01 00:07:00'
GROUP BY
product_id
) si
ORDER BY
sumsell DESC, product_id
LIMIT 10
) q1
UNION ALL
SELECT *
FROM (
SELECT p.id, 0
FROM product p
WHERE NOT EXISTS
(
SELECT NULL
FROM sale si
WHERE product_id = p.id
AND dt >= '2010-01-01'
AND dt < '2010-01-01 00:07:00'
)
ORDER BY
p.id
LIMIT 10
) q2
ORDER BY
sumsell DESC, product_id
LIMIT 10
) q
JOIN product p
ON p.id = q.product_id
View query details
id |
name |
sumsell |
50630 |
Product 50630 |
1 |
143395 |
Product 143395 |
1 |
222114 |
Product 222114 |
1 |
322966 |
Product 322966 |
1 |
340133 |
Product 340133 |
1 |
345789 |
Product 345789 |
1 |
462937 |
Product 462937 |
1 |
1 |
Product 1 |
0 |
2 |
Product 2 |
0 |
3 |
Product 3 |
0 |
10 rows fetched in 0.0012s (0.0064s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
<derived2> |
ALL |
|
|
|
|
10 |
100.00 |
|
1 |
PRIMARY |
p |
eq_ref |
PRIMARY |
PRIMARY |
4 |
q.product_id |
1 |
100.00 |
|
2 |
DERIVED |
<derived3> |
ALL |
|
|
|
|
7 |
100.00 |
|
3 |
DERIVED |
<derived4> |
ALL |
|
|
|
|
7 |
100.00 |
Using filesort |
4 |
DERIVED |
si |
range |
ix_sale_dt_product |
ix_sale_dt_product |
8 |
|
6 |
100.00 |
Using where; Using index; Using temporary; Using filesort |
5 |
UNION |
<derived6> |
ALL |
|
|
|
|
10 |
100.00 |
|
6 |
DERIVED |
p |
index |
|
PRIMARY |
4 |
|
10 |
5007270.00 |
Using where; Using index |
7 |
DEPENDENT SUBQUERY |
si |
ref |
ix_sale_product_dt,ix_sale_dt_product |
ix_sale_product_dt |
4 |
20100305_left.p.id |
1 |
100.00 |
Using where; Using index |
|
UNION RESULT |
<union2,5> |
ALL |
|
|
|
|
|
|
Using filesort |
Field or reference '20100305_left.p.id' of SELECT #7 was resolved in SELECT #6
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,`q`.`sumsell` AS `sumsell` from (select `q1`.`product_id` AS `product_id`,`q1`.`sumsell` AS `sumsell` from (select `si`.`product_id` AS `product_id`,`si`.`sumsell` AS `sumsell` from (select `20100305_left`.`si`.`product_id` AS `product_id`,count(0) AS `sumsell` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`dt` >= '2010-01-01') and (`20100305_left`.`si`.`dt` < '2010-01-01 00:07:00')) group by `20100305_left`.`si`.`product_id`) `si` order by `si`.`sumsell` desc,`si`.`product_id` limit 10) `q1` union all select `q2`.`id` AS `id`,`q2`.`0` AS `0` from (select `20100305_left`.`p`.`id` AS `id`,0 AS `0` from `20100305_left`.`product` `p` where (not(exists(select NULL AS `NULL` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`si`.`dt` >= '2010-01-01') and (`20100305_left`.`si`.`dt` < '2010-01-01 00:07:00'))))) order by `20100305_left`.`p`.`id` limit 10) `q2` order by `sumsell` desc,`product_id` limit 10) `q` join `20100305_left`.`product` `p` where (`20100305_left`.`p`.`id` = `q`.`product_id`)
This query completes in less than 7 ms (which is comparable to the time measurement error).