Matching sets: aggregates vs. first miss at EXPLAIN EXTENDED

Here is my schema:

Suppliers
sid sname address

Parts
pid pname color

Catalog
sid pid cost

I need to find the sids of suppliers who supply every red part or every green part.

This task requires matching the sets.

We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.

Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.

There are two ways to check that:

First miss technique: test each record from the subset candidate against the superset candidate, returning FALSE if there is no match.
Aggregate technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match

Let's test which way is faster in which cases. To do this, we will need some sample tables:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE suppliers (
        sid INT NOT NULL PRIMARY KEY,
        sname VARCHAR(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE parts (
        pid INT NOT NULL PRIMARY KEY,
        pname VARCHAR(100) NOT NULL,
        color VARCHAR(100) NOT NULL,
        KEY ix_parts_color (color)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE catalog (
        sid INT NOT NULL,
        pid INT NOT NULL,
        cost REAL NOT NULL,
        PRIMARY KEY pk_catalog_sp(sid, pid),
        KEY ix_catalog_pid (pid)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(10000);
COMMIT;

INSERT
INTO    suppliers
SELECT  id, CONCAT('Supplier ', id)
FROM    filler
ORDER BY
        id
LIMIT 10000;

INSERT
INTO    parts
SELECT  id, CONCAT('Part ', id), CASE WHEN id <= 20 THEN ELT(id % 2 + 1, 'red', 'green') ELSE 'blue' END
FROM    filler
ORDER BY
        id
LIMIT 2000;

INSERT
INTO    catalog
SELECT  sid, pid, 100
FROM    (
        SELECT  sid, pid, RAND(20100225) AS rnd
        FROM    suppliers
        JOIN    parts
        ON      color IN ('red', 'green')
        ) q
WHERE   rnd < 0.4;

INSERT
INTO    catalog
SELECT  sid, pid, 100
FROM    (
        SELECT  sid, pid, RAND(20100225 << 1) AS rnd
        FROM    (
                SELECT  sid
                FROM    suppliers
                ORDER BY
                        sid
                LIMIT 200
                ) q
        JOIN    parts
        ON      color = 'blue'
        ) q2
WHERE   rnd < 0.998;
&#91;/sourcecode&#93;
</div>

There are <strong>10,000</strong> suppliers and <strong>2,000</strong> parts.

The parts can be red, green or blue. As for red and green, there are <strong>10</strong> parts of each color, and they are distributed evenly across the suppliers. With blue, the situation is different: there are <strong>1980</strong> blue parts and only <strong>200</strong> first suppliers provide it. However, for each of the blue part suppliers, the probability of each blue part to be available is very high.

<h3>First miss</h3>

The first miss is a combination of <code>NOT IN</code> / <code>NOT EXISTS</code> clauses that immediately return <code>FALSE</code> whenever a single non-matching record is found. Since <strong>MySQL</strong> can only do nested loops, the performance of these queries is heavily dependent on proper indexing.

Let's run this query to search for red or green parts:


SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  'red' AS color
                UNION ALL
                SELECT  'green' AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )

View query details

sid	sname
7442	Supplier 7442
1 row fetched in 0.0003s (2.1719s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	s	ALL					10342	100.00	Using where
2	DEPENDENT SUBQUERY	<derived3>	ALL					2	100.00	Using where
5	DEPENDENT SUBQUERY	p	ref	ix_parts_color	ix_parts_color	302	func	439	100.00	Using where; Using index
6	DEPENDENT SUBQUERY	c	eq_ref	PRIMARY,ix_catalog_pid	PRIMARY	8	20100225_sets.s.sid,func	1	100.00	Using index
3	DERIVED									No tables used
4	UNION									No tables used
	UNION RESULT	<union3,4>	ALL

Field or reference '20100225_sets.s.sid' of SELECT #6 was resolved in SELECT #1
select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select 'red' AS `color` union all select 'green' AS `color`) `ci` where (not(<in_optimizer>(`ci`.`color`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(<in_optimizer>(`20100225_sets`.`p`.`pid`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (<cache>(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(<cache>(`ci`.`color`) using utf8) = `20100225_sets`.`p`.`color`)))))))

This query runs for 2.17 s.

The same query for the blue parts:

SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  'blue' AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )

View query details

sid	sname
10	Supplier 10
96	Supplier 96
101	Supplier 101
3 rows fetched in 0.0005s (2.4375s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	s	ALL					10342	100.00	Using where
2	DEPENDENT SUBQUERY	<derived3>	system					1	100.00
4	DEPENDENT SUBQUERY	p	ref	ix_parts_color	ix_parts_color	302	func	439	100.00	Using where; Using index
5	DEPENDENT SUBQUERY	c	eq_ref	PRIMARY,ix_catalog_pid	PRIMARY	8	20100225_sets.s.sid,func	1	100.00	Using index
3	DERIVED									No tables used

Field or reference '20100225_sets.s.sid' of SELECT #5 was resolved in SELECT #1
select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select 'blue' AS `color`) `ci` where (not(<in_optimizer>('blue',<exists>(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(<in_optimizer>(`20100225_sets`.`p`.`pid`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (<cache>(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(<cache>('blue') using utf8) = `20100225_sets`.`p`.`color`)))))))

This time the query is a little bit slower: 2.43 s.

Aggregate

In the aggregate solution, we first calculate the total number of parts for each color and the count the parts of this color supplied by each supplier. If the counts match, the sets match too.

Let's try it on red and green parts first:

SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN ('red', 'green')
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total

View query details

sid	color
7442	green
1 row fetched in 0.0003s (1.1406s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					2	100.00	Using temporary; Using filesort
1	PRIMARY	p	ref	PRIMARY,ix_parts_color	ix_parts_color	302	t.color	439	100.00	Using index
1	PRIMARY	c	ref	ix_catalog_pid	ix_catalog_pid	4	20100225_sets.p.pid	109	100.00	Using index
2	DERIVED	parts	range	ix_parts_color	ix_parts_color	302		19	100.00	Using where; Using index

select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` in ('red','green')) group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = `t`.`color`) and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid`,`20100225_sets`.`p`.`color`,`t`.`total` having (count(0) = `t`.`total`)

And the same query on the blue parts:

SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN ('blue')
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total

View query details

sid	color
10	blue
96	blue
101	blue
3 rows fetched in 0.0004s (3.0906s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	system					1	100.00	Using temporary; Using filesort
1	PRIMARY	p	ref	PRIMARY,ix_parts_color	ix_parts_color	302	const	1319	100.00	Using index
1	PRIMARY	c	ref	ix_catalog_pid	ix_catalog_pid	4	20100225_sets.p.pid	109	100.00	Using index
2	DERIVED	parts	ref	ix_parts_color	ix_parts_color	302		1319	100.00	Using where; Using index

select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` = 'blue') group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = 'blue') and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid` having (count(0) = '1980')

We see that the queries complete in 1.14 seconds and 3.09 seconds, accordingly.

The aggregate method is more efficient for the red and green parts, while the first miss is more efficient for the blue parts.

Analysis

What is the reason of such a difference in performance?

The first miss method generally needs to parse less records (only those before the first miss), but each record is searched for in a nested loop, starting from the index root. The aggregate method needs to parse all records to calculate the COUNT(*), but these records are fetched in a sequential index access.

The red and green parts have the large number of small sets, with at most 10 records in each. The probability of the miss is relatively small: a large number of records should be parsed before any record is missed in a set. The aggregates, on the other hand, can be calculated very fast, since there are relatively few records to aggregate.

With the blue parts, the situation is different. There are few large sets, and calculating the aggregates requires fetching, sorting and grouping of too many records. First misses, on the other hand, occurs almost instantly: the vast majority of the suppliers do not offer any blue parts at all.

Summary

As it often happens, the performance of the two methods to compare the sets depends on the data distribution.

The sets with less records and lower probability of the record miss will benefit from the aggregate method, since the performance increase caused by the sequential access to the records overweights the need to parse a larger number of the records.

The sets with more records and higher probability of a miss will return the misses very soon, so the first miss method is more beneficial for them.

Written by Quassnoi

February 25th, 2010 at 11:00 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

EXPLAIN EXTENDED