EXPLAIN EXTENDED

How to create fast database queries

MySQL: counting items in pairs

Comments enabled. I *really* need your comment

From Stack Overflow:

I currently have an item table, and a pair table.

The pair table simply contains two columns, which contain the PRIMARY KEY from the item table. There is an index on id1, id2 on pair.

A common query is to find a number of items that are featured in the least number of pairs:

SELECT  id, COUNT(*) AS count
FROM    item i
LEFT JOIN
pair p
ON      (p.id1 = i.id OR p.id2 = i.id)
GROUP BY
id
ORDER BY
count, RAND()
LIMIT 100

, but the query is horible performance wise.

Is there a better query, and/or data structure for this type of thing?

MySQL is not very good in optimizing OR conditions.

It's capable of doing index_merge, which would UNION the results of two indexes, however, as documentation states, this access method works only when comparing the fields against the constants.

This method, therefore, can be used to serve an OR condition neither in a JOIN nor in a subquery.

Let's create sample tables and see the execution plan for the query:

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE item (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(20) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE pair (
pid INT NOT NULL PRIMARY KEY,
id1 INT NOT NULL,
id2 INT NOT NULL,
KEY ix_pair_12 (id1, id2),
KEY ix_pair_2 (id2)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    item
SELECT  id, CONCAT('Item ', id)
FROM    filler
ORDER BY
        id
LIMIT 100;

INSERT
INTO    pair
SELECT  id, FLOOR(RAND(20090731) * 100) + 1, FLOOR(RAND(20090731 << 1) * 100) + 1
FROM    filler;

There are 100 items and 100,000 random pairs.

Here's the query:

SELECT  id, COUNT(*) AS count
FROM    item i
LEFT JOIN
pair p
ON      (p.id1 = i.id OR p.id2 = i.id)
GROUP BY
id
ORDER BY
count, RAND(20090731)
LIMIT 5

id count
86 1869
70 1887
92 1895
66 1909
16 1913
5 rows fetched in 0.0002s (7.2968s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE i index PRIMARY 4 100 100.00 Using index; Using temporary; Using filesort
1 SIMPLE p index ix_pair_12,ix_pair_2 ix_pair_12 8 100423 100.00 Using index

The index is used just for browsing, not filtering. The query takes more that 7 seconds.

We need to split the query into two separate queries that would use the corresponding indexes.

In SQL Server or PostgreSQL the following query would be possible:

SELECT  id,
(
SELECT  COUNT(*)
FROM    (
SELECT  pid
FROM    pair
WHERE   id1 = id
UNION
SELECT  pid
FROM    pair
WHERE   id2 = id
) q
)
count
FROM    item i

Unfortunately, MySQL (and Oracle too) do not support correlated query nesting more than one level deep.

The query above will give the following error:

Unknown column 'id' in 'where clause'

However, we can rewrite the query a little.

We can use a well-known formula: |A ∪ B| = |A| + |B| - |A ∩ B|

In other words, the number of elements in a union of two sets is the sum of numbers of elements in each set minus the number of elements in their intersection.

Each of the sets in the expression above can be efficiently queried with a subquery:

SELECT  id,
(
SELECT  COUNT(*)
FROM    pair
WHERE   id2 = id
) +
(
SELECT  COUNT(*)
FROM    pair
WHERE   id1 = id
) -
(
SELECT  COUNT(*)
FROM    pair
WHERE   id1 = id
AND id2 = id
)
count
FROM    item i
ORDER BY
count, RAND(20090731)
LIMIT 5

id count
86 1869
70 1887
92 1895
66 1909
16 1913
5 rows fetched in 0.0002s (0.1419s)

This is much faster, only 140 ms.

And we can improve the query yet a little more. Since id1 = id AND id2 = id is a subset of id1 = id, we can merge these two subqueries into one:

SELECT  id,
(
SELECT  COUNT(*)
FROM    pair
WHERE   id2 = id
) +
(
SELECT  COUNT(*)
FROM    pair
WHERE   id1 = id
AND id2 <> id
)
count
FROM    item i
ORDER BY
count, RAND(20090731)
LIMIT 5

id count
86 1869
70 1887
92 1895
66 1909
16 1913
5 rows fetched in 0.0002s (0.1382s)

, which is yet a trifle more efficient, only 138 ms.

Written by Quassnoi

July 31st, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply