Matching sets: aggregates vs. first miss
Comments enabled. I *really* need your comment
From Stack Overflow:
Here is my schema:
Suppliers sid sname address
Parts pid pname color
Catalog sid pid cost I need to find the sids of suppliers who supply every red part or every green part.
This task requires matching the sets.
We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.
Unlike other engines, MySQL does not provide the set operators like EXCEPT
or MINUS
that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.
There are two ways to check that:
First miss
technique: test each record from the subset candidate against the superset candidate, returningFALSE
if there is no match.Aggregate
technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match
Let's test which way is faster in which cases. To do this, we will need some sample tables:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE suppliers ( sid INT NOT NULL PRIMARY KEY, sname VARCHAR(100) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE parts ( pid INT NOT NULL PRIMARY KEY, pname VARCHAR(100) NOT NULL, color VARCHAR(100) NOT NULL, KEY ix_parts_color (color) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE catalog ( sid INT NOT NULL, pid INT NOT NULL, cost REAL NOT NULL, PRIMARY KEY pk_catalog_sp(sid, pid), KEY ix_catalog_pid (pid) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(10000); COMMIT; INSERT INTO suppliers SELECT id, CONCAT('Supplier ', id) FROM filler ORDER BY id LIMIT 10000; INSERT INTO parts SELECT id, CONCAT('Part ', id), CASE WHEN id <= 20 THEN ELT(id % 2 + 1, 'red', 'green') ELSE 'blue' END FROM filler ORDER BY id LIMIT 2000; INSERT INTO catalog SELECT sid, pid, 100 FROM ( SELECT sid, pid, RAND(20100225) AS rnd FROM suppliers JOIN parts ON color IN ('red', 'green') ) q WHERE rnd < 0.4; INSERT INTO catalog SELECT sid, pid, 100 FROM ( SELECT sid, pid, RAND(20100225 << 1) AS rnd FROM ( SELECT sid FROM suppliers ORDER BY sid LIMIT 200 ) q JOIN parts ON color = 'blue' ) q2 WHERE rnd < 0.998; [/sourcecode] </div> There are <strong>10,000</strong> suppliers and <strong>2,000</strong> parts. The parts can be red, green or blue. As for red and green, there are <strong>10</strong> parts of each color, and they are distributed evenly across the suppliers. With blue, the situation is different: there are <strong>1980</strong> blue parts and only <strong>200</strong> first suppliers provide it. However, for each of the blue part suppliers, the probability of each blue part to be available is very high. <h3>First miss</h3> The first miss is a combination of <code>NOT IN</code> / <code>NOT EXISTS</code> clauses that immediately return <code>FALSE</code> whenever a single non-matching record is found. Since <strong>MySQL</strong> can only do nested loops, the performance of these queries is heavily dependent on proper indexing. Let's run this query to search for red or green parts: SELECT * FROM suppliers s WHERE EXISTS ( SELECT NULL FROM ( SELECT 'red' AS color UNION ALL SELECT 'green' AS color ) ci WHERE color NOT IN ( SELECT color FROM parts p WHERE p.pid NOT IN ( SELECT pid FROM catalog c WHERE c.sid = s.sid ) ) )
sid | sname | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7442 | Supplier 7442 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0003s (2.1719s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | s | ALL | 10342 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | <derived3> | ALL | 2 | 100.00 | Using where | ||||
5 | DEPENDENT SUBQUERY | p | ref | ix_parts_color | ix_parts_color | 302 | func | 439 | 100.00 | Using where; Using index |
6 | DEPENDENT SUBQUERY | c | eq_ref | PRIMARY,ix_catalog_pid | PRIMARY | 8 | 20100225_sets.s.sid,func | 1 | 100.00 | Using index |
3 | DERIVED | No tables used | ||||||||
4 | UNION | No tables used | ||||||||
UNION RESULT | <union3,4> | ALL |
Field or reference '20100225_sets.s.sid' of SELECT #6 was resolved in SELECT #1 select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select 'red' AS `color` union all select 'green' AS `color`) `ci` where (not(<in_optimizer>(`ci`.`color`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(<in_optimizer>(`20100225_sets`.`p`.`pid`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (<cache>(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(<cache>(`ci`.`color`) using utf8) = `20100225_sets`.`p`.`color`)))))))
This query runs for 2.17 s.
The same query for the blue parts:
SELECT * FROM suppliers s WHERE EXISTS ( SELECT NULL FROM ( SELECT 'blue' AS color ) ci WHERE color NOT IN ( SELECT color FROM parts p WHERE p.pid NOT IN ( SELECT pid FROM catalog c WHERE c.sid = s.sid ) ) )
sid | sname | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | Supplier 10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
96 | Supplier 96 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
101 | Supplier 101 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0005s (2.4375s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | s | ALL | 10342 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | <derived3> | system | 1 | 100.00 | |||||
4 | DEPENDENT SUBQUERY | p | ref | ix_parts_color | ix_parts_color | 302 | func | 439 | 100.00 | Using where; Using index |
5 | DEPENDENT SUBQUERY | c | eq_ref | PRIMARY,ix_catalog_pid | PRIMARY | 8 | 20100225_sets.s.sid,func | 1 | 100.00 | Using index |
3 | DERIVED | No tables used |
Field or reference '20100225_sets.s.sid' of SELECT #5 was resolved in SELECT #1 select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select 'blue' AS `color`) `ci` where (not(<in_optimizer>('blue',<exists>(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(<in_optimizer>(`20100225_sets`.`p`.`pid`,<exists>(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (<cache>(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(<cache>('blue') using utf8) = `20100225_sets`.`p`.`color`)))))))
This time the query is a little bit slower: 2.43 s.
Aggregate
In the aggregate solution, we first calculate the total number of parts for each color and the count the parts of this color supplied by each supplier. If the counts match, the sets match too.
Let's try it on red and green parts first:
SELECT c.sid, p.color FROM ( SELECT color, COUNT(*) AS total FROM parts WHERE color IN ('red', 'green') GROUP BY color ) t JOIN parts p ON p.color = t.color JOIN catalog c ON c.pid = p.pid GROUP BY sid, color, total HAVING COUNT(*) = total
sid | color | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7442 | green | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0003s (1.1406s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 2 | 100.00 | Using temporary; Using filesort | ||||
1 | PRIMARY | p | ref | PRIMARY,ix_parts_color | ix_parts_color | 302 | t.color | 439 | 100.00 | Using index |
1 | PRIMARY | c | ref | ix_catalog_pid | ix_catalog_pid | 4 | 20100225_sets.p.pid | 109 | 100.00 | Using index |
2 | DERIVED | parts | range | ix_parts_color | ix_parts_color | 302 | 19 | 100.00 | Using where; Using index |
select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` in ('red','green')) group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = `t`.`color`) and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid`,`20100225_sets`.`p`.`color`,`t`.`total` having (count(0) = `t`.`total`)
And the same query on the blue parts:
SELECT c.sid, p.color FROM ( SELECT color, COUNT(*) AS total FROM parts WHERE color IN ('blue') GROUP BY color ) t JOIN parts p ON p.color = t.color JOIN catalog c ON c.pid = p.pid GROUP BY sid, color, total HAVING COUNT(*) = total
sid | color | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | blue | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
96 | blue | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
101 | blue | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0004s (3.0906s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | system | 1 | 100.00 | Using temporary; Using filesort | ||||
1 | PRIMARY | p | ref | PRIMARY,ix_parts_color | ix_parts_color | 302 | const | 1319 | 100.00 | Using index |
1 | PRIMARY | c | ref | ix_catalog_pid | ix_catalog_pid | 4 | 20100225_sets.p.pid | 109 | 100.00 | Using index |
2 | DERIVED | parts | ref | ix_parts_color | ix_parts_color | 302 | 1319 | 100.00 | Using where; Using index |
select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` = 'blue') group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = 'blue') and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid` having (count(0) = '1980')
We see that the queries complete in 1.14 seconds and 3.09 seconds, accordingly.
The aggregate
method is more efficient for the red and green parts, while the first miss
is more efficient for the blue parts.
Analysis
What is the reason of such a difference in performance?
The first miss
method generally needs to parse less records (only those before the first miss), but each record is searched for in a nested loop, starting from the index root. The aggregate method needs to parse all records to calculate the COUNT(*)
, but these records are fetched in a sequential index access.
The red and green parts have the large number of small sets, with at most 10 records in each. The probability of the miss is relatively small: a large number of records should be parsed before any record is missed in a set. The aggregates, on the other hand, can be calculated very fast, since there are relatively few records to aggregate.
With the blue parts, the situation is different. There are few large sets, and calculating the aggregates requires fetching, sorting and grouping of too many records. First misses, on the other hand, occurs almost instantly: the vast majority of the suppliers do not offer any blue parts at all.
Summary
As it often happens, the performance of the two methods to compare the sets depends on the data distribution.
The sets with less records and lower probability of the record miss will benefit from the aggregate method, since the performance increase caused by the sequential access to the records overweights the need to parse a larger number of the records.
The sets with more records and higher probability of a miss will return the misses very soon, so the first miss method is more beneficial for them.
Written by Quassnoi
February 25th, 2010 at 11:00 pm
Posted in MySQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009