MySQL: matching whole sets at EXPLAIN EXTENDED

I have a database that stores products available on the market and products still in development in two separate tables (market_product and dev_product).

A third table (substance) contains all substances a product can made of.

Other two tables (marked_product_comp and dev_product_comp) mantains product compositions.

I want to select products still in development that are made of the same ingredients of marketed products.

This means that we need to return every dev_product that has at least one market_product with complete substance match.

In the following example:

DevProductID	Substance
1	Apple
1	Banana
2	Apple
2	Banana
2	Cabbage

MarketProductID	Substance
1	Apple
2	Apple
2	Banana
3	Apple
3	Banana
3	Cabbage
3	Daikon

we should return development product 1, since it's made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.

We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.

In one of the previous articles I wrote about a similar task in SQL Server:

Matching whole sets

However, this used FULL OUTER JOIN which MySQL lacks.

Let's see how to implement this task in MySQL:

Table creation scripts

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE dev_product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE market_product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE dev_product_comp (
        prodID INT NOT NULL,
        substID INT NOT NULL,
        PRIMARY KEY (prodID, substID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE market_product_comp (
        prodID INT NOT NULL,
        substID INT NOT NULL,
        PRIMARY KEY (prodID, substID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500);
COMMIT;

INSERT
INTO    dev_product
SELECT  id, CONCAT('Dev product ', id)
FROM    filler;

INSERT
INTO    market_product
SELECT  id, CONCAT('Market product ', id)
FROM    filler;

INSERT
INTO    dev_product_comp
SELECT  *
FROM    (
        SELECT  d.id AS prodID, s.id AS substID
        FROM    dev_product d
        CROSS JOIN
                (
                SELECT  id
                FROM    filler
                ORDER BY
                        id
                LIMIT 15
                ) s
        ) q
WHERE   RAND(20090908) < 0.5;

INSERT
INTO    market_product_comp
SELECT  *
FROM    (
        SELECT  m.id AS prodID, s.id AS substID
        FROM    market_product m
        CROSS JOIN
                (
                SELECT  id
                FROM    filler
                ORDER BY
                        id
                LIMIT 15
                ) s
        ) q
WHERE   RAND(20090908 << 1) < 0.5;
&#91;/sourcecode&#93;
</div>

There are <strong>500</strong> products both on market and in development, each having a random number of components (<strong>7.5</strong> in average).

Since <strong>MySQL</strong> lacks <code>FULL OUTER JOIN</code>, we need to mimic it.

We could use <code>LEFT JOIN</code>'s in a <code>UNION</code>, however, <strong>MySQL</strong> has a nasty <a href="http://bugs.mysql.com/bug.php?id=35455">bug</a> (introduced somewhere on its way to version <strong>5</strong>): it does not allow using outer query expression in <code>ON</code> clauses of correlated subqueries.

To work around this, we could use a self join inside each correlated subquery which would allow us to use <strong>WHERE</strong> clause instead. However, even a join on <code>id</code> will degrade the query performance.

That's why we'll have to rewrite this query as a bunch of nested <code>EXISTS</code> statements. This will be less readable, however, more efficient.

We need to find all development products for which exists a market product that shares the component set. This means that for the given development/market pair there is neither a development product component not found in a market product nor a market component not found in development product.

Here's this query in <strong>SQL</strong>:


SELECT  *
FROM    dev_product dp
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    market_product mp
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    dev_product_comp dpc
                WHERE   dpc.prodID = dp.id
                        AND NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    market_product_comp mpc
                        WHERE   mpc.prodID = mp.id
                                AND mpc.substID = dpc.substID
                        )
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    market_product_comp mpc
                WHERE   mpc.prodID = mp.id
                        AND NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    dev_product_comp dpc
                        WHERE   dpc.prodID = dp.id
                                AND dpc.substID = mpc.substID
                        )
                )

        )

And here are the results:

Show results and execution plan

id	name
30	Dev product 30
74	Dev product 74
136	Dev product 136
139	Dev product 139
282	Dev product 282
411	Dev product 411
6 rows fetched in 0.0002s (5.0779s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	dp	ALL					442	100.00	Using where
2	DEPENDENT SUBQUERY	mp	index		PRIMARY	4		371	100.00	Using where; Using index
5	DEPENDENT SUBQUERY	mpc	ref	PRIMARY	PRIMARY	4	20090908_sets.mp.id	3	100.00	Using where; Using index
6	DEPENDENT SUBQUERY	dpc	eq_ref	PRIMARY	PRIMARY	8	20090908_sets.dp.id,20090908_sets.mpc.substID	1	100.00	Using index
3	DEPENDENT SUBQUERY	dpc	ref	PRIMARY	PRIMARY	4	20090908_sets.dp.id	3	100.00	Using where; Using index
4	DEPENDENT SUBQUERY	mpc	eq_ref	PRIMARY	PRIMARY	8	20090908_sets.mp.id,20090908_sets.dpc.substID	1	100.00	Using index

This query does almost what it says: it takes each dev/market pair and checks the missing components in both ways.

EXISTS conditions return as soon as anything is found, that's why in fact not every pair is checked.

However, misses are very probable (the query returns only 6 products of 500 possible), so this query actually shows the worst case.

Nevertheless, it completes in only 5 seconds which is quite fast for 250,000 pairs.

In case of more probable hits, the query will complete even sooner.

Written by Quassnoi

September 8th, 2009 at 11:00 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

EXPLAIN EXTENDED