From Stack Overflow:
I have a database that stores products available on the market
and products still in development
in two separate tables (market_product
and dev_product
).
A third table (substance
) contains all substances a product can made of.
Other two tables (marked_product_comp
and dev_product_comp
) mantains product compositions.
I want to select products still in development that are made of the same ingredients of marketed products.
This means that we need to return every dev_product
that has at least one market_product
with complete substance match.
In the following example:
DevProductID |
Substance |
1 |
Apple |
1 |
Banana |
2 |
Apple |
2 |
Banana |
2 |
Cabbage |
MarketProductID |
Substance |
1 |
Apple |
2 |
Apple |
2 |
Banana |
3 |
Apple |
3 |
Banana |
3 |
Cabbage |
3 |
Daikon |
we should return development product 1, since it's made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.
We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.
In one of the previous articles I wrote about a similar task in SQL Server:
However, this used FULL OUTER JOIN
which MySQL lacks.
Let's see how to implement this task in MySQL:
Table creation scripts
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE dev_product (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE market_product (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE dev_product_comp (
prodID INT NOT NULL,
substID INT NOT NULL,
PRIMARY KEY (prodID, substID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE market_product_comp (
prodID INT NOT NULL,
substID INT NOT NULL,
PRIMARY KEY (prodID, substID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(500);
COMMIT;
INSERT
INTO dev_product
SELECT id, CONCAT('Dev product ', id)
FROM filler;
INSERT
INTO market_product
SELECT id, CONCAT('Market product ', id)
FROM filler;
INSERT
INTO dev_product_comp
SELECT *
FROM (
SELECT d.id AS prodID, s.id AS substID
FROM dev_product d
CROSS JOIN
(
SELECT id
FROM filler
ORDER BY
id
LIMIT 15
) s
) q
WHERE RAND(20090908) < 0.5;
INSERT
INTO market_product_comp
SELECT *
FROM (
SELECT m.id AS prodID, s.id AS substID
FROM market_product m
CROSS JOIN
(
SELECT id
FROM filler
ORDER BY
id
LIMIT 15
) s
) q
WHERE RAND(20090908 << 1) < 0.5;
[/sourcecode]
</div>
There are <strong>500</strong> products both on market and in development, each having a random number of components (<strong>7.5</strong> in average).
Since <strong>MySQL</strong> lacks <code>FULL OUTER JOIN</code>, we need to mimic it.
We could use <code>LEFT JOIN</code>'s in a <code>UNION</code>, however, <strong>MySQL</strong> has a nasty <a href="http://bugs.mysql.com/bug.php?id=35455">bug</a> (introduced somewhere on its way to version <strong>5</strong>): it does not allow using outer query expression in <code>ON</code> clauses of correlated subqueries.
To work around this, we could use a self join inside each correlated subquery which would allow us to use <strong>WHERE</strong> clause instead. However, even a join on <code>id</code> will degrade the query performance.
That's why we'll have to rewrite this query as a bunch of nested <code>EXISTS</code> statements. This will be less readable, however, more efficient.
We need to find all development products for which exists a market product that shares the component set. This means that for the given development/market pair there is neither a development product component not found in a market product nor a market component not found in development product.
Here's this query in <strong>SQL</strong>:
SELECT *
FROM dev_product dp
WHERE EXISTS
(
SELECT NULL
FROM market_product mp
WHERE NOT EXISTS
(
SELECT NULL
FROM dev_product_comp dpc
WHERE dpc.prodID = dp.id
AND NOT EXISTS
(
SELECT NULL
FROM market_product_comp mpc
WHERE mpc.prodID = mp.id
AND mpc.substID = dpc.substID
)
)
AND NOT EXISTS
(
SELECT NULL
FROM market_product_comp mpc
WHERE mpc.prodID = mp.id
AND NOT EXISTS
(
SELECT NULL
FROM dev_product_comp dpc
WHERE dpc.prodID = dp.id
AND dpc.substID = mpc.substID
)
)
)
And here are the results:
Show results and execution plan
id |
name |
30 |
Dev product 30 |
74 |
Dev product 74 |
136 |
Dev product 136 |
139 |
Dev product 139 |
282 |
Dev product 282 |
411 |
Dev product 411 |
6 rows fetched in 0.0002s (5.0779s) |
id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
dp |
ALL |
|
|
|
|
442 |
100.00 |
Using where |
2 |
DEPENDENT SUBQUERY |
mp |
index |
|
PRIMARY |
4 |
|
371 |
100.00 |
Using where; Using index |
5 |
DEPENDENT SUBQUERY |
mpc |
ref |
PRIMARY |
PRIMARY |
4 |
20090908_sets.mp.id |
3 |
100.00 |
Using where; Using index |
6 |
DEPENDENT SUBQUERY |
dpc |
eq_ref |
PRIMARY |
PRIMARY |
8 |
20090908_sets.dp.id,20090908_sets.mpc.substID |
1 |
100.00 |
Using index |
3 |
DEPENDENT SUBQUERY |
dpc |
ref |
PRIMARY |
PRIMARY |
4 |
20090908_sets.dp.id |
3 |
100.00 |
Using where; Using index |
4 |
DEPENDENT SUBQUERY |
mpc |
eq_ref |
PRIMARY |
PRIMARY |
8 |
20090908_sets.mp.id,20090908_sets.dpc.substID |
1 |
100.00 |
Using index |
This query does almost what it says: it takes each dev/market pair and checks the missing components in both ways.
EXISTS
conditions return as soon as anything is found, that's why in fact not every pair is checked.
However, misses are very probable (the query returns only 6 products of 500 possible), so this query actually shows the worst case.
Nevertheless, it completes in only 5 seconds which is quite fast for 250,000 pairs.
In case of more probable hits, the query will complete even sooner.
Leave a Reply