Aggregates and LEFT JOIN at EXPLAIN EXTENDED

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:
SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
sale s
ON      s.product_id = p.id
AND s.dt &gt;= '2010-01-01'
AND s.dt &lt; '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10
, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let's create the sample tables:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE TABLE product (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE sale (
id INT NOT NULL PRIMARY KEY,
product_id INT NOT NULL,
amount FLOAT NOT NULL,
dt DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    product
SELECT  id, CONCAT('Product ', id)
FROM    filler;

INSERT
INTO    sale (id, product_id, amount, dt)
SELECT  id,
        FLOOR(RAND(20100305) * 500000) + 1,
        RAND(20100305 &lt;&lt; 1) * 100 + 1,
        '2010-03-06' - INTERVAL id MINUTE
FROM    filler;

CREATE INDEX ix_sale_product_dt ON sale (product_id, dt);
CREATE INDEX ix_sale_dt_product ON sale (dt, product_id);

The table contains 500,000 products and 500,000 random sales (1,440 sales per day).

Now, let's run the query similar to the author's one. I adjusted the period so that fewer than 10 actual sales were made during the period and LEFT JOIN records can be seen in the table:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
sale s
ON      s.product_id = p.id
AND s.dt &gt;= '2010-01-01'
AND s.dt &lt; '2010-01-01 00:07:00'
GROUP BY
        p.id
ORDER BY
        sumsell DESC, p.id
LIMIT 10

View query details

id	name	sumsell
50630	Product 50630	1
143395	Product 143395	1
222114	Product 222114	1
322966	Product 322966	1
340133	Product 340133	1
345789	Product 345789	1
462937	Product 462937	1
1	Product 1	0
2	Product 2	0
3	Product 3	0
10 rows fetched in 0.0009s (3.0312s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	p	index		PRIMARY	4		10	5007270.00	Using temporary; Using filesort
1	SIMPLE	s	ref	ix_sale_product_dt,ix_sale_dt_product	ix_sale_product_dt	4	20100305_left.p.id	1	100.00	Using index

select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,count(`20100305_left`.`s`.`id`) AS `sumsell` from `20100305_left`.`product` `p` left join `20100305_left`.`sale` `s` on(((`20100305_left`.`s`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`s`.`dt` >= '2010-01-01') and (`20100305_left`.`s`.`dt` < '2010-01-01 00:07:00'))) where 1 group by `20100305_left`.`p`.`id` order by count(`20100305_left`.`s`.`id`) desc,`20100305_left`.`p`.`id` limit 10

The query runs for 3 seconds.

We see that, first, product is made leading, and, second, only a part of the index on sale (product, dt) is used: each sale is only filtered on product, not on date.

Since there were only 7 sales during the period we have chosen, it would be a wise decision to make sale leading in the join so that it could be filtered on date and the resulting recordset was then joined to product. This would result in at most 7 PRIMARY KEY seeks instead of 500,000 range scans and would be much more efficient.

However, this is only possible with the INNER JOIN, and if there are less then 10 products sold within the time period, we will not see the rest.

To work around this, we need to emulate the LEFT JOIN:

Find the products sold within the time period, using an INNER JOIN of product with the resultset containg aggregated sales.
Find the products not sold within the time period, using NOT EXISTS predicate.
Concatenate the two resultsets using UNION ALL.

The step 2 implies that product is leading again, so normally it would not be much of improvement. But in our case, we don't need the whole recordset, we only need the top 10 sales.

So we can just order and limit the recordsets retrieved on steps 1 and 2 to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.

The second resultset will contain a hardcoded 0 in the sumsell, so we just need to order it on product.id. Since product is an InnoDB table and product.id is a clustered PRIMARY KEY, this is not a problem.

Here's the query:

SELECT  p.*, sumsell
FROM    (
SELECT  *
FROM    (
SELECT  product_id, sumsell
FROM    (
SELECT  product_id, COUNT(*) AS sumsell
FROM    sale si
WHERE   dt &gt;= '2010-01-01'
AND dt &lt; '2010-01-01 00:07:00'
                        GROUP BY
                                product_id        
                        ) si
                ORDER BY
                        sumsell DESC, product_id
                LIMIT 10
                ) q1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  p.id, 0
                FROM    product p
                WHERE   NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    sale si
                        WHERE   product_id = p.id
                                AND dt &gt;= '2010-01-01'
AND dt &lt; '2010-01-01 00:07:00'
                        )
                ORDER BY
                        p.id
                LIMIT 10
                ) q2
        ORDER BY
                sumsell DESC, product_id
        LIMIT 10
        ) q
JOIN    product p
ON      p.id = q.product_id

View query details

id	name	sumsell
50630	Product 50630	1
143395	Product 143395	1
222114	Product 222114	1
322966	Product 322966	1
340133	Product 340133	1
345789	Product 345789	1
462937	Product 462937	1
1	Product 1	0
2	Product 2	0
3	Product 3	0
10 rows fetched in 0.0012s (0.0064s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					10	100.00
1	PRIMARY	p	eq_ref	PRIMARY	PRIMARY	4	q.product_id	1	100.00
2	DERIVED	<derived3>	ALL					7	100.00
3	DERIVED	<derived4>	ALL					7	100.00	Using filesort
4	DERIVED	si	range	ix_sale_dt_product	ix_sale_dt_product	8		6	100.00	Using where; Using index; Using temporary; Using filesort
5	UNION	<derived6>	ALL					10	100.00
6	DERIVED	p	index		PRIMARY	4		10	5007270.00	Using where; Using index
7	DEPENDENT SUBQUERY	si	ref	ix_sale_product_dt,ix_sale_dt_product	ix_sale_product_dt	4	20100305_left.p.id	1	100.00	Using where; Using index
	UNION RESULT	<union2,5>	ALL							Using filesort

Field or reference '20100305_left.p.id' of SELECT #7 was resolved in SELECT #6
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,`q`.`sumsell` AS `sumsell` from (select `q1`.`product_id` AS `product_id`,`q1`.`sumsell` AS `sumsell` from (select `si`.`product_id` AS `product_id`,`si`.`sumsell` AS `sumsell` from (select `20100305_left`.`si`.`product_id` AS `product_id`,count(0) AS `sumsell` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`dt` >= '2010-01-01') and (`20100305_left`.`si`.`dt` < '2010-01-01 00:07:00')) group by `20100305_left`.`si`.`product_id`) `si` order by `si`.`sumsell` desc,`si`.`product_id` limit 10) `q1` union all select `q2`.`id` AS `id`,`q2`.`0` AS `0` from (select `20100305_left`.`p`.`id` AS `id`,0 AS `0` from `20100305_left`.`product` `p` where (not(exists(select NULL AS `NULL` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`si`.`dt` >= '2010-01-01') and (`20100305_left`.`si`.`dt` < '2010-01-01 00:07:00'))))) order by `20100305_left`.`p`.`id` limit 10) `q2` order by `sumsell` desc,`product_id` limit 10) `q` join `20100305_left`.`product` `p` where (`20100305_left`.`p`.`id` = `q`.`product_id`)

This query completes in less than 7 ms (which is comparable to the time measurement error).

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Aggregates and LEFT JOIN

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow