Aggregates and LEFT JOIN
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table
productwith products and tablesalewith all sale operations that were done on these products.I would like to get 10 most often sold products today and what I did is this:
SELECT p.*, COUNT(s.id) AS sumsell FROM product p LEFT JOIN sale s ON s.product_id = p.id AND s.dt >= '2010-01-01' AND s.dt < '2010-01-02' GROUP BY p.id ORDER BY sumsell DESC LIMIT 10, but performance of it is very slow.
What can I do to increase performance of this particular query?
The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.
Let's create the sample tables:
Table creation details
The table contains 500,000 products and 500,000 random sales (1,440 sales per day).
Now, let's run the query similar to the author's one. I adjusted the period so that fewer than 10 actual sales were made during the period and LEFT JOIN records can be seen in the table:
SELECT p.*, COUNT(s.id) AS sumsell
FROM product p
LEFT JOIN
sale s
ON s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-01 00:07:00'
GROUP BY
p.id
ORDER BY
sumsell DESC, p.id
LIMIT 10
The query runs for 3 seconds.
We see that, first, product is made leading, and, second, only a part of the index on sale (product, dt) is used: each sale is only filtered on product, not on date.
Since there were only 7 sales during the period we have chosen, it would be a wise decision to make sale leading in the join so that it could be filtered on date and the resulting recordset was then joined to product. This would result in at most 7 PRIMARY KEY seeks instead of 500,000 range scans and would be much more efficient.
However, this is only possible with the INNER JOIN, and if there are less then 10 products sold within the time period, we will not see the rest.
To work around this, we need to emulate the LEFT JOIN:
- Find the products sold within the time period, using an
INNER JOINofproductwith the resultset containg aggregated sales. - Find the products not sold within the time period, using
NOT EXISTSpredicate. - Concatenate the two resultsets using
UNION ALL.
The step 2 implies that product is leading again, so normally it would not be much of improvement. But in our case, we don't need the whole recordset, we only need the top 10 sales.
So we can just order and limit the recordsets retrieved on steps 1 and 2 to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.
The second resultset will contain a hardcoded 0 in the sumsell, so we just need to order it on product.id. Since product is an InnoDB table and product.id is a clustered PRIMARY KEY, this is not a problem.
Here's the query:
SELECT p.*, sumsell
FROM (
SELECT *
FROM (
SELECT product_id, sumsell
FROM (
SELECT product_id, COUNT(*) AS sumsell
FROM sale si
WHERE dt >= '2010-01-01'
AND dt < '2010-01-01 00:07:00'
GROUP BY
product_id
) si
ORDER BY
sumsell DESC, product_id
LIMIT 10
) q1
UNION ALL
SELECT *
FROM (
SELECT p.id, 0
FROM product p
WHERE NOT EXISTS
(
SELECT NULL
FROM sale si
WHERE product_id = p.id
AND dt >= '2010-01-01'
AND dt < '2010-01-01 00:07:00'
)
ORDER BY
p.id
LIMIT 10
) q2
ORDER BY
sumsell DESC, product_id
LIMIT 10
) q
JOIN product p
ON p.id = q.product_id
This query completes in less than 7 ms (which is comparable to the time measurement error).
Subscribe in a reader