MySQL: counting items in pairs
Comments enabled. I *really* need your comment
From Stack Overflow:
I currently have an
item
table, and apair
table.The
pair
table simply contains two columns, which contain thePRIMARY KEY
from theitem
table. There is an index onid1, id2
onpair
.A common query is to find a number of items that are featured in the least number of pairs:
SELECT id, COUNT(*) AS count FROM item i LEFT JOIN pair p ON (p.id1 = i.id OR p.id2 = i.id) GROUP BY id ORDER BY count, RAND() LIMIT 100, but the query is horible performance wise.
Is there a better query, and/or data structure for this type of thing?
MySQL is not very good in optimizing OR
conditions.
It's capable of doing index_merge
, which would UNION
the results of two indexes, however, as documentation states, this access method works only when comparing the fields against the constants.
This method, therefore, can be used to serve an OR
condition neither in a JOIN
nor in a subquery.
Let's create sample tables and see the execution plan for the query:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE item ( id INT NOT NULL PRIMARY KEY, value VARCHAR(20) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE pair ( pid INT NOT NULL PRIMARY KEY, id1 INT NOT NULL, id2 INT NOT NULL, KEY ix_pair_12 (id1, id2), KEY ix_pair_2 (id2) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(100000); COMMIT; INSERT INTO item SELECT id, CONCAT('Item ', id) FROM filler ORDER BY id LIMIT 100; INSERT INTO pair SELECT id, FLOOR(RAND(20090731) * 100) + 1, FLOOR(RAND(20090731 << 1) * 100) + 1 FROM filler;
There are 100 items and 100,000 random pairs.
Here's the query:
SELECT id, COUNT(*) AS count FROM item i LEFT JOIN pair p ON (p.id1 = i.id OR p.id2 = i.id) GROUP BY id ORDER BY count, RAND(20090731) LIMIT 5
id | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
86 | 1869 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
70 | 1887 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 | 1895 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
66 | 1909 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 1913 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0002s (7.2968s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | i | index | PRIMARY | 4 | 100 | 100.00 | Using index; Using temporary; Using filesort | ||
1 | SIMPLE | p | index | ix_pair_12,ix_pair_2 | ix_pair_12 | 8 | 100423 | 100.00 | Using index |
The index is used just for browsing, not filtering. The query takes more that 7 seconds.
We need to split the query into two separate queries that would use the corresponding indexes.
In SQL Server or PostgreSQL the following query would be possible:
SELECT id, ( SELECT COUNT(*) FROM ( SELECT pid FROM pair WHERE id1 = id UNION SELECT pid FROM pair WHERE id2 = id ) q ) count FROM item i
Unfortunately, MySQL (and Oracle too) do not support correlated query nesting more than one level deep.
The query above will give the following error:
Unknown column 'id' in 'where clause'
However, we can rewrite the query a little.
We can use a well-known formula: |A ∪ B| = |A| + |B| - |A ∩ B|
In other words, the number of elements in a union of two sets is the sum of numbers of elements in each set minus the number of elements in their intersection.
Each of the sets in the expression above can be efficiently queried with a subquery:
SELECT id, ( SELECT COUNT(*) FROM pair WHERE id2 = id ) + ( SELECT COUNT(*) FROM pair WHERE id1 = id ) - ( SELECT COUNT(*) FROM pair WHERE id1 = id AND id2 = id ) count FROM item i ORDER BY count, RAND(20090731) LIMIT 5
id | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
86 | 1869 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
70 | 1887 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 | 1895 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
66 | 1909 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 1913 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0002s (0.1419s) |
This is much faster, only 140 ms.
And we can improve the query yet a little more. Since id1 = id AND id2 = id
is a subset of id1 = id
, we can merge these two subqueries into one:
SELECT id, ( SELECT COUNT(*) FROM pair WHERE id2 = id ) + ( SELECT COUNT(*) FROM pair WHERE id1 = id AND id2 <> id ) count FROM item i ORDER BY count, RAND(20090731) LIMIT 5
id | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
86 | 1869 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
70 | 1887 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 | 1895 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
66 | 1909 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 1913 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0002s (0.1382s) |
, which is yet a trifle more efficient, only 138 ms.