PostgreSQL: selecting items that belong to all categories at EXPLAIN EXTENDED

I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:

Project Category

Foo Apple

Foo Banana

Foo Carrot

Bar Apple

Bar Carrot

Qux Apple

Qux Banana

What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.

For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.

If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.

Project	Category
Foo	Apple
Foo	Banana
Foo	Carrot
Bar	Apple
Bar	Carrot
Qux	Apple
Qux	Banana

A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.

Let's create a sample table:

Table creation details

CREATE TABLE t_category (
        item INT NOT NULL,
        category INT NOT NULL
        );

ALTER TABLE t_category
ADD CONSTRAINT pk_category_item_category PRIMARY KEY (category, item);

SELECT  SETSEED(0.20090723);

INSERT
INTO    t_category
SELECT  item, category
FROM    (
        SELECT  item, category, RANDOM() AS rnd
        FROM    generate_series(1, 100000) item
        CROSS JOIN
                generate_series(1, 20) category
        ) q
WHERE   rnd < 0.1 OR category <= 10;
 
ANALYZE t_category;
&#91;/sourcecode&#93;
</div>

We have <strong>100,000</strong> items with categories from <strong>1</strong> to <strong>20</strong> randomly assigned to each item.

Categories from <strong>1</strong> to <strong>10</strong> are assigned always (i. e. all items belong to them), while categories from <strong>11</strong> to <strong>20</strong> are assigned randomly, each having <strong>10%</strong> probability of being assigned.

Now, let's select all items that belong to all categories from <strong>1</strong> to <strong>14</strong>.

A typical query looks like this:


SELECT  item
FROM    t_category
WHERE   category IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
GROUP BY
        item
HAVING  COUNT(*) = 14

See results and execution plan

item
10
10310
90050
22510
28357
1096
17567
60911
8 rows fetched in 0.0003s (0.9725s)

HashAggregate  (cost=38571.15..39506.36 rows=62347 width=4)
  Filter: (count(*) = 14)
  ->  Seq Scan on t_category  (cost=0.00..35116.96 rows=690839 width=4)
        Filter: (category = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::integer[]))

This works for almost 1 second.

Counting all categories takes too long in this solution.

But could we just stop counting when it's obvious that an item does not satisfy the condition (say, one of the categories was not found)?

To do this, we will use a nested NOT EXISTS construct.

The query will work as following:

First, we select a list of DISTINCT items for the least used category.

For the design like this it's very useful to collect statistics once a day and keep these statistics in categories table (how many items a category contains).

However, for our database we just know that categories from 1 to 10 are assigned to 100,000 items each, and categories from 11 to 20 are assigned to but 10,000 items in average.

That's why we can just pick a random category from 11 to 20, say, 14.

Then we check that all categories in the list (but the first one) do exist for each of the items contained in category 14.

To do this, we generate a resultset of conditions inside a NOT EXIST construct, and check each of the conditions against the table also with a NOT EXISTS condition. If but a single condition will not be found for a given item, both constructs will return FALSE immediately and the record will be skipped instantly.

We don't need to check for category 14 in the NOT EXISTS condition, since the resultset we are checking is guaranteed to contain only the items within category 14.

To build the resultset for the outermost NOT EXISTS, we will use generate_series and use its value as a subscript to an array of categories.

Here's the query:

SELECT  *
FROM    (
        SELECT  DISTINCT item
        FROM    t_category
        WHERE   category = 14
        ) mo
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    generate_series(1, 13) s
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_category mi
                WHERE   mi.item = mo.item
                        AND mi.category = ('{13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}'::INTEGER[])[s]
                )
        )

See results and execution plan

item
90050
1096
10310
17567
10
28357
22510
60911
8 rows fetched in 0.0002s (0.0978s)

Nested Loop Anti Join  (cost=5326.22..4035945.55 rows=472 width=4)
  Join Filter: (NOT (alternatives: SubPlan 1 or hashed SubPlan 2))
  ->  HashAggregate  (cost=5326.22..5335.66 rows=944 width=4)
        ->  Bitmap Heap Scan on t_category  (cost=301.47..5300.09 rows=10450 width=4)
              Recheck Cond: (category = 14)
              ->  Bitmap Index Scan on pk_category_item_category  (cost=0.00..298.85 rows=10450 width=0)
                    Index Cond: (category = 14)
  ->  Function Scan on generate_series s  (cost=0.00..12.50 rows=1000 width=4)
  SubPlan 1
    ->  Index Scan using pk_category_item_category on t_category mi  (cost=0.00..8.50 rows=1 width=0)
          Index Cond: ((category = ('{13,12,11,10,9,8,7,6,5,4,3,2,1}'::integer[])[$1]) AND (item = $0))
  SubPlan 2
    ->  Seq Scan on t_category mi  (cost=0.00..15867.62 rows=1099962 width=8)

This completes in 90 ms: more then 10 times faster than GROUP BY solution.

Note that we put the categories into array in certain order: from the least probable ones to the most probable ones.

Thus, the categories that are more likely to fail are checked first, so that the record could be skipped earlier if it is not contained in all the categories required.

Written by Quassnoi

August 25th, 2009 at 11:00 pm

Posted in PostgreSQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED