EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: selecting items that belong to all categories

Comments enabled. I *really* need your comment

From Stack Overflow:

I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:

Project Category
Foo Apple
Foo Banana
Foo Carrot
Bar Apple
Bar Carrot
Qux Apple
Qux Banana

What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.

For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.

If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.

A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.

Let's create a sample table:


Table creation details

We have 100,000 items with categories from 1 to 20 randomly assigned to each item.

Categories from 1 to 10 are assigned always (i. e. all items belong to them), while categories from 11 to 20 are assigned randomly, each having 10% probability of being assigned.

Now, let's select all items that belong to all categories from 1 to 14.

A typical query looks like this:

SELECT  item
FROM    t_category
WHERE   category IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
GROUP BY
        item
HAVING  COUNT(*) = 14

See results and execution plan

This works for almost 1 second.

Counting all categories takes too long in this solution.

But could we just stop counting when it's obvious that an item does not satisfy the condition (say, one of the categories was not found)?

To do this, we will use a nested NOT EXISTS construct.

The query will work as following:

First, we select a list of DISTINCT items for the least used category.

For the design like this it's very useful to collect statistics once a day and keep these statistics in categories table (how many items a category contains).

However, for our database we just know that categories from 1 to 10 are assigned to 100,000 items each, and categories from 11 to 20 are assigned to but 10,000 items in average.

That's why we can just pick a random category from 11 to 20, say, 14.

Then we check that all categories in the list (but the first one) do exist for each of the items contained in category 14.

To do this, we generate a resultset of conditions inside a NOT EXIST construct, and check each of the conditions against the table also with a NOT EXISTS condition. If but a single condition will not be found for a given item, both constructs will return FALSE immediately and the record will be skipped instantly.

We don't need to check for category 14 in the NOT EXISTS condition, since the resultset we are checking is guaranteed to contain only the items within category 14.

To build the resultset for the outermost NOT EXISTS, we will use generate_series and use its value as a subscript to an array of categories.

Here's the query:

SELECT  *
FROM    (
        SELECT  DISTINCT item
        FROM    t_category
        WHERE   category = 14
        ) mo
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    generate_series(1, 13) s
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_category mi
                WHERE   mi.item = mo.item
                        AND mi.category = ('{13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}'::INTEGER[])[s]
                )
        )

See results and execution plan

This completes in 90 ms: more then 10 times faster than GROUP BY solution.

Note that we put the categories into array in certain order: from the least probable ones to the most probable ones.

Thus, the categories that are more likely to fail are checked first, so that the record could be skipped earlier if it is not contained in all the categories required.

Written by Quassnoi

August 25th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply