PostgreSQL: selecting items that belong to all categories
Comments enabled. I *really* need your comment
From Stack Overflow:
I’ve got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.
Let’s create a sample table:
We have 100,000 items with categories from 1 to 20 randomly assigned to each item.
Categories from 1 to 10 are assigned always (i. e. all items belong to them), while categories from 11 to 20 are assigned randomly, each having 10% probability of being assigned.
Now, let’s select all items that belong to all categories from 1 to 14.
A typical query looks like this:
SELECT item
FROM t_category
WHERE category IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
GROUP BY
item
HAVING COUNT(*) = 14
See results and execution plan
This works for almost 1 second.
Counting all categories takes too long in this solution.
But could we just stop counting when it’s obvious that an item does not satisfy the condition (say, one of the categories was not found)?
To do this, we will use a nested NOT EXISTS construct.
The query will work as following:
First, we select a list of DISTINCT items for the least used category.
For the design like this it’s very useful to collect statistics once a day and keep these statistics in categories table (how many items a category contains).
However, for our database we just know that categories from 1 to 10 are assigned to 100,000 items each, and categories from 11 to 20 are assigned to but 10,000 items in average.
That’s why we can just pick a random category from 11 to 20, say, 14.
Then we check that all categories in the list (but the first one) do exist for each of the items contained in category 14.
To do this, we generate a resultset of conditions inside a NOT EXIST construct, and check each of the conditions against the table also with a NOT EXISTS condition. If but a single condition will not be found for a given item, both constructs will return FALSE immediately and the record will be skipped instantly.
We don’t need to check for category 14 in the NOT EXISTS condition, since the resultset we are checking is guaranteed to contain only the items within category 14.
To build the resultset for the outermost NOT EXISTS, we will use generate_series and use its value as a subscript to an array of categories.
Here’s the query:
SELECT *
FROM (
SELECT DISTINCT item
FROM t_category
WHERE category = 14
) mo
WHERE NOT EXISTS
(
SELECT NULL
FROM generate_series(1, 13) s
WHERE NOT EXISTS
(
SELECT NULL
FROM t_category mi
WHERE mi.item = mo.item
AND mi.category = ('{13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}'::INTEGER[])[s]
)
)
See results and execution plan
This completes in 90 ms: more then 10 times faster than GROUP BY solution.
Note that we put the categories into array in certain order: from the least probable ones to the most probable ones.
Thus, the categories that are more likely to fail are checked first, so that the record could be skipped earlier if it is not contained in all the categories required.
Subscribe in a reader