EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

MySQL: matching whole sets

From Stack Overflow:

I have a database that stores products available on the market and products still in development in two separate tables (market_product and dev_product).

A third table (substance) contains all substances a product can made of.

Other two tables (marked_product_comp and dev_product_comp) mantains product compositions.

I want to select products still in development that are made of the same ingredients of marketed products.

This means that we need to return every dev_product that has at least one market_product with complete substance match.

In the following example:

DevProductID Substance
1 Apple
1 Banana
2 Apple
2 Banana
2 Cabbage
MarketProductID Substance
1 Apple
2 Apple
2 Banana
3 Apple
3 Banana
3 Cabbage
3 Daikon

we should return development product 1, since it’s made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.

We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.

In one of the previous articles I wrote about a similar task in SQL Server:

However, this used FULL OUTER JOIN which MySQL lacks.

Let’s see how to implement this task in MySQL:


Table creation scripts

There are 500 products both on market and in development, each having a random number of components (7.5 in average).

Since MySQL lacks FULL OUTER JOIN, we need to mimic it.

We could use LEFT JOIN‘s in a UNION, however, MySQL has a nasty bug (introduced somewhere on its way to version 5): it does not allow using outer query expression in ON clauses of correlated subqueries.

To work around this, we could use a self join inside each correlated subquery which would allow us to use WHERE clause instead. However, even a join on id will degrade the query performance.

That’s why we’ll have to rewrite this query as a bunch of nested EXISTS statements. This will be less readable, however, more efficient.

We need to find all development products for which exists a market product that shares the component set. This means that for the given development/market pair there is neither a development product component not found in a market product nor a market component not found in development product.

Here’s this query in SQL:

SELECT  *
FROM    dev_product dp
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    market_product mp
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    dev_product_comp dpc
                WHERE   dpc.prodID = dp.id
                        AND NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    market_product_comp mpc
                        WHERE   mpc.prodID = mp.id
                                AND mpc.substID = dpc.substID
                        )
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    market_product_comp mpc
                WHERE   mpc.prodID = mp.id
                        AND NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    dev_product_comp dpc
                        WHERE   dpc.prodID = dp.id
                                AND dpc.substID = mpc.substID
                        )
                )

        )

And here are the results:

Show results and execution plan

This query does almost what it says: it takes each dev/market pair and checks the missing components in both ways.

EXISTS conditions return as soon as anything is found, that’s why in fact not every pair is checked.

However, misses are very probable (the query returns only 6 products of 500 possible), so this query actually shows the worst case.

Nevertheless, it completes in only 5 seconds which is quite fast for 250,000 pairs.

In case of more probable hits, the query will complete even sooner.

Written by Quassnoi

September 8th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.