EXPLAIN EXTENDED

How to create fast database queries

Aggregates and LEFT JOIN

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt >= '2010-01-01'
        AND s.dt < '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10

, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let's create the sample tables:

Table creation details

The table contains 500,000 products and 500,000 random sales (1,440 sales per day).

Now, let's run the query similar to the author's one. I adjusted the period so that fewer than 10 actual sales were made during the period and LEFT JOIN records can be seen in the table:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt >= '2010-01-01'
        AND s.dt < '2010-01-01 00:07:00'
GROUP BY
        p.id
ORDER BY
        sumsell DESC, p.id
LIMIT 10

View query details

The query runs for 3 seconds.

We see that, first, product is made leading, and, second, only a part of the index on sale (product, dt) is used: each sale is only filtered on product, not on date.

Since there were only 7 sales during the period we have chosen, it would be a wise decision to make sale leading in the join so that it could be filtered on date and the resulting recordset was then joined to product. This would result in at most 7 PRIMARY KEY seeks instead of 500,000 range scans and would be much more efficient.

However, this is only possible with the INNER JOIN, and if there are less then 10 products sold within the time period, we will not see the rest.

To work around this, we need to emulate the LEFT JOIN:

  1. Find the products sold within the time period, using an INNER JOIN of product with the resultset containg aggregated sales.

  2. Find the products not sold within the time period, using NOT EXISTS predicate.

  3. Concatenate the two resultsets using UNION ALL.

The step 2 implies that product is leading again, so normally it would not be much of improvement. But in our case, we don't need the whole recordset, we only need the top 10 sales.

So we can just order and limit the recordsets retrieved on steps 1 and 2 to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.

The second resultset will contain a hardcoded 0 in the sumsell, so we just need to order it on product.id. Since product is an InnoDB table and product.id is a clustered PRIMARY KEY, this is not a problem.

Here's the query:

SELECT  p.*, sumsell
FROM    (
        SELECT  *
        FROM    (
                SELECT  product_id, sumsell
                FROM    (
                        SELECT  product_id, COUNT(*) AS sumsell
                        FROM    sale si
                        WHERE   dt >= '2010-01-01'
                                AND dt < '2010-01-01 00:07:00'
                        GROUP BY
                                product_id        
                        ) si
                ORDER BY
                        sumsell DESC, product_id
                LIMIT 10
                ) q1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  p.id, 0
                FROM    product p
                WHERE   NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    sale si
                        WHERE   product_id = p.id
                                AND dt >= '2010-01-01'
                                AND dt < '2010-01-01 00:07:00'
                        )
                ORDER BY
                        p.id
                LIMIT 10
                ) q2
        ORDER BY
                sumsell DESC, product_id
        LIMIT 10
        ) q
JOIN    product p
ON      p.id = q.product_id

View query details

This query completes in less than 7 ms (which is comparable to the time measurement error).

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply