MySQL: matching whole sets
From Stack Overflow:
I have a database that stores products
available on the marketand productsstill in developmentin two separate tables (market_productanddev_product).A third table (
substance) contains all substances a product can made of.Other two tables (
marked_product_companddev_product_comp) mantains product compositions.I want to select products still in development that are made of the same ingredients of marketed products.
This means that we need to return every dev_product that has at least one market_product with complete substance match.
In the following example:
| DevProductID | Substance |
|---|---|
| 1 | Apple |
| 1 | Banana |
| 2 | Apple |
| 2 | Banana |
| 2 | Cabbage |
| MarketProductID | Substance |
|---|---|
| 1 | Apple |
| 2 | Apple |
| 2 | Banana |
| 3 | Apple |
| 3 | Banana |
| 3 | Cabbage |
| 3 | Daikon |
we should return development product 1, since it’s made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.
We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.
In one of the previous articles I wrote about a similar task in SQL Server:
However, this used FULL OUTER JOIN which MySQL lacks.
Let’s see how to implement this task in MySQL:
There are 500 products both on market and in development, each having a random number of components (7.5 in average).
Since MySQL lacks FULL OUTER JOIN, we need to mimic it.
We could use LEFT JOIN‘s in a UNION, however, MySQL has a nasty bug (introduced somewhere on its way to version 5): it does not allow using outer query expression in ON clauses of correlated subqueries.
To work around this, we could use a self join inside each correlated subquery which would allow us to use WHERE clause instead. However, even a join on id will degrade the query performance.
That’s why we’ll have to rewrite this query as a bunch of nested EXISTS statements. This will be less readable, however, more efficient.
We need to find all development products for which exists a market product that shares the component set. This means that for the given development/market pair there is neither a development product component not found in a market product nor a market component not found in development product.
Here’s this query in SQL:
SELECT *
FROM dev_product dp
WHERE EXISTS
(
SELECT NULL
FROM market_product mp
WHERE NOT EXISTS
(
SELECT NULL
FROM dev_product_comp dpc
WHERE dpc.prodID = dp.id
AND NOT EXISTS
(
SELECT NULL
FROM market_product_comp mpc
WHERE mpc.prodID = mp.id
AND mpc.substID = dpc.substID
)
)
AND NOT EXISTS
(
SELECT NULL
FROM market_product_comp mpc
WHERE mpc.prodID = mp.id
AND NOT EXISTS
(
SELECT NULL
FROM dev_product_comp dpc
WHERE dpc.prodID = dp.id
AND dpc.substID = mpc.substID
)
)
)
And here are the results:
Show results and execution plan
This query does almost what it says: it takes each dev/market pair and checks the missing components in both ways.
EXISTS conditions return as soon as anything is found, that’s why in fact not every pair is checked.
However, misses are very probable (the query returns only 6 products of 500 possible), so this query actually shows the worst case.
Nevertheless, it completes in only 5 seconds which is quite fast for 250,000 pairs.
In case of more probable hits, the query will complete even sooner.
Subscribe in a reader