EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Matching sets: aggregates vs. first miss

Comments enabled. I *really* need your comment

From Stack Overflow:

Here is my schema:

Suppliers
sid sname address
Parts
pid pname color
Catalog
sid pid cost

I need to find the sids of suppliers who supply every red part or every green part.

This task requires matching the sets.

We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.

Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.

There are two ways to check that:

  • First miss technique: test each record from the subset candidate against the superset candidate, returning FALSE if there is no match.
  • Aggregate technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match

Let’s test which way is faster in which cases. To do this, we will need some sample tables:

Table creation details

There are 10,000 suppliers and 2,000 parts.

The parts can be red, green or blue. As for red and green, there are 10 parts of each color, and they are distributed evenly across the suppliers. With blue, the situation is different: there are 1980 blue parts and only 200 first suppliers provide it. However, for each of the blue part suppliers, the probability of each blue part to be available is very high.

First miss

The first miss is a combination of NOT IN / NOT EXISTS clauses that immediately return FALSE whenever a single non-matching record is found. Since MySQL can only do nested loops, the performance of these queries is heavily dependent on proper indexing.

Let’s run this query to search for red or green parts:

SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  'red' AS color
                UNION ALL
                SELECT  'green' AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )

View query details

This query runs for 2.17 s.

The same query for the blue parts:

SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  'blue' AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )

View query details

This time the query is a little bit slower: 2.43 s.

Aggregate

In the aggregate solution, we first calculate the total number of parts for each color and the count the parts of this color supplied by each supplier. If the counts match, the sets match too.

Let’s try it on red and green parts first:

SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN ('red', 'green')
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total

View query details

And the same query on the blue parts:

SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN ('blue')
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total

View query details

We see that the queries complete in 1.14 seconds and 3.09 seconds, accordingly.

The aggregate method is more efficient for the red and green parts, while the first miss is more efficient for the blue parts.

Analysis

What is the reason of such a difference in performance?

The first miss method generally needs to parse less records (only those before the first miss), but each record is searched for in a nested loop, starting from the index root. The aggregate method needs to parse all records to calculate the COUNT(*), but these records are fetched in a sequential index access.

The red and green parts have the large number of small sets, with at most 10 records in each. The probability of the miss is relatively small: a large number of records should be parsed before any record is missed in a set. The aggregates, on the other hand, can be calculated very fast, since there are relatively few records to aggregate.

With the blue parts, the situation is different. There are few large sets, and calculating the aggregates requires fetching, sorting and grouping of too many records. First misses, on the other hand, occurs almost instantly: the vast majority of the suppliers do not offer any blue parts at all.

Summary

As it often happens, the performance of the two methods to compare the sets depends on the data distribution.

The sets with less records and lower probability of the record miss will benefit from the aggregate method, since the performance increase caused by the sequential access to the records overweights the need to parse a larger number of the records.

The sets with more records and higher probability of a miss will return the misses very soon, so the first miss method is more beneficial for them.

Written by Quassnoi

February 25th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply