Matching sets: aggregates vs. first miss
Comments enabled. I *really* need your comment
From Stack Overflow:
Here is my schema:
Suppliers sid sname address
Parts pid pname color
Catalog sid pid cost I need to find the sids of suppliers who supply every red part or every green part.
This task requires matching the sets.
We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.
Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.
There are two ways to check that:
First miss
technique: test each record from the subset candidate against the superset candidate, returningFALSEif there is no match.Aggregate
technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match
Let’s test which way is faster in which cases. To do this, we will need some sample tables:
Table creation details
There are 10,000 suppliers and 2,000 parts.
The parts can be red, green or blue. As for red and green, there are 10 parts of each color, and they are distributed evenly across the suppliers. With blue, the situation is different: there are 1980 blue parts and only 200 first suppliers provide it. However, for each of the blue part suppliers, the probability of each blue part to be available is very high.
First miss
The first miss is a combination of NOT IN / NOT EXISTS clauses that immediately return FALSE whenever a single non-matching record is found. Since MySQL can only do nested loops, the performance of these queries is heavily dependent on proper indexing.
Let’s run this query to search for red or green parts:
SELECT *
FROM suppliers s
WHERE EXISTS
(
SELECT NULL
FROM (
SELECT 'red' AS color
UNION ALL
SELECT 'green' AS color
) ci
WHERE color NOT IN
(
SELECT color
FROM parts p
WHERE p.pid NOT IN
(
SELECT pid
FROM catalog c
WHERE c.sid = s.sid
)
)
)
This query runs for 2.17 s.
The same query for the blue parts:
SELECT *
FROM suppliers s
WHERE EXISTS
(
SELECT NULL
FROM (
SELECT 'blue' AS color
) ci
WHERE color NOT IN
(
SELECT color
FROM parts p
WHERE p.pid NOT IN
(
SELECT pid
FROM catalog c
WHERE c.sid = s.sid
)
)
)
This time the query is a little bit slower: 2.43 s.
Aggregate
In the aggregate solution, we first calculate the total number of parts for each color and the count the parts of this color supplied by each supplier. If the counts match, the sets match too.
Let’s try it on red and green parts first:
SELECT c.sid, p.color
FROM (
SELECT color, COUNT(*) AS total
FROM parts
WHERE color IN ('red', 'green')
GROUP BY
color
) t
JOIN parts p
ON p.color = t.color
JOIN catalog c
ON c.pid = p.pid
GROUP BY
sid, color, total
HAVING COUNT(*) = total
And the same query on the blue parts:
SELECT c.sid, p.color
FROM (
SELECT color, COUNT(*) AS total
FROM parts
WHERE color IN ('blue')
GROUP BY
color
) t
JOIN parts p
ON p.color = t.color
JOIN catalog c
ON c.pid = p.pid
GROUP BY
sid, color, total
HAVING COUNT(*) = total
We see that the queries complete in 1.14 seconds and 3.09 seconds, accordingly.
The aggregate
method is more efficient for the red and green parts, while the first miss
is more efficient for the blue parts.
Analysis
What is the reason of such a difference in performance?
The first miss
method generally needs to parse less records (only those before the first miss), but each record is searched for in a nested loop, starting from the index root. The aggregate method needs to parse all records to calculate the COUNT(*), but these records are fetched in a sequential index access.
The red and green parts have the large number of small sets, with at most 10 records in each. The probability of the miss is relatively small: a large number of records should be parsed before any record is missed in a set. The aggregates, on the other hand, can be calculated very fast, since there are relatively few records to aggregate.
With the blue parts, the situation is different. There are few large sets, and calculating the aggregates requires fetching, sorting and grouping of too many records. First misses, on the other hand, occurs almost instantly: the vast majority of the suppliers do not offer any blue parts at all.
Summary
As it often happens, the performance of the two methods to compare the sets depends on the data distribution.
The sets with less records and lower probability of the record miss will benefit from the aggregate method, since the performance increase caused by the sequential access to the records overweights the need to parse a larger number of the records.
The sets with more records and higher probability of a miss will return the misses very soon, so the first miss method is more beneficial for them.
Subscribe in a reader