PostgreSQL: selecting items that belong to all categories
Comments enabled. I *really* need your comment
From Stack Overflow:
I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY and COUNT. However, this can be made more efficient.
Let's create a sample table:
CREATE TABLE t_category (
item INT NOT NULL,
category INT NOT NULL
);
ALTER TABLE t_category
ADD CONSTRAINT pk_category_item_category PRIMARY KEY (category, item);
SELECT SETSEED(0.20090723);
INSERT
INTO t_category
SELECT item, category
FROM (
SELECT item, category, RANDOM() AS rnd
FROM generate_series(1, 100000) item
CROSS JOIN
generate_series(1, 20) category
) q
WHERE rnd < 0.1 OR category <= 10;
ANALYZE t_category;
[/sourcecode]
</div>
We have <strong>100,000</strong> items with categories from <strong>1</strong> to <strong>20</strong> randomly assigned to each item.
Categories from <strong>1</strong> to <strong>10</strong> are assigned always (i. e. all items belong to them), while categories from <strong>11</strong> to <strong>20</strong> are assigned randomly, each having <strong>10%</strong> probability of being assigned.
Now, let's select all items that belong to all categories from <strong>1</strong> to <strong>14</strong>.
A typical query looks like this:
SELECT item
FROM t_category
WHERE category IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
GROUP BY
item
HAVING COUNT(*) = 14
See results and execution plan
| item | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10310 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 90050 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22510 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28357 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1096 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17567 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 60911 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0.0003s (0.9725s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HashAggregate (cost=38571.15..39506.36 rows=62347 width=4)
Filter: (count(*) = 14)
-> Seq Scan on t_category (cost=0.00..35116.96 rows=690839 width=4)
Filter: (category = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::integer[]))
This works for almost 1 second.
Counting all categories takes too long in this solution.
But could we just stop counting when it's obvious that an item does not satisfy the condition (say, one of the categories was not found)?
To do this, we will use a nested NOT EXISTS construct.
The query will work as following:
First, we select a list of DISTINCT items for the least used category.
For the design like this it's very useful to collect statistics once a day and keep these statistics in categories table (how many items a category contains).
However, for our database we just know that categories from 1 to 10 are assigned to 100,000 items each, and categories from 11 to 20 are assigned to but 10,000 items in average.
That's why we can just pick a random category from 11 to 20, say, 14.
Then we check that all categories in the list (but the first one) do exist for each of the items contained in category 14.
To do this, we generate a resultset of conditions inside a NOT EXIST construct, and check each of the conditions against the table also with a NOT EXISTS condition. If but a single condition will not be found for a given item, both constructs will return FALSE immediately and the record will be skipped instantly.
We don't need to check for category 14 in the NOT EXISTS condition, since the resultset we are checking is guaranteed to contain only the items within category 14.
To build the resultset for the outermost NOT EXISTS, we will use generate_series and use its value as a subscript to an array of categories.
Here's the query:
SELECT *
FROM (
SELECT DISTINCT item
FROM t_category
WHERE category = 14
) mo
WHERE NOT EXISTS
(
SELECT NULL
FROM generate_series(1, 13) s
WHERE NOT EXISTS
(
SELECT NULL
FROM t_category mi
WHERE mi.item = mo.item
AND mi.category = ('{13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}'::INTEGER[])[s]
)
)
See results and execution plan
| item | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90050 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1096 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10310 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17567 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28357 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22510 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 60911 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0.0002s (0.0978s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nested Loop Anti Join (cost=5326.22..4035945.55 rows=472 width=4)
Join Filter: (NOT (alternatives: SubPlan 1 or hashed SubPlan 2))
-> HashAggregate (cost=5326.22..5335.66 rows=944 width=4)
-> Bitmap Heap Scan on t_category (cost=301.47..5300.09 rows=10450 width=4)
Recheck Cond: (category = 14)
-> Bitmap Index Scan on pk_category_item_category (cost=0.00..298.85 rows=10450 width=0)
Index Cond: (category = 14)
-> Function Scan on generate_series s (cost=0.00..12.50 rows=1000 width=4)
SubPlan 1
-> Index Scan using pk_category_item_category on t_category mi (cost=0.00..8.50 rows=1 width=0)
Index Cond: ((category = ('{13,12,11,10,9,8,7,6,5,4,3,2,1}'::integer[])[$1]) AND (item = $0))
SubPlan 2
-> Seq Scan on t_category mi (cost=0.00..15867.62 rows=1099962 width=8)
This completes in 90 ms: more then 10 times faster than GROUP BY solution.
Note that we put the categories into array in certain order: from the least probable ones to the most probable ones.
Thus, the categories that are more likely to fail are checked first, so that the record could be skipped earlier if it is not contained in all the categories required.
Written by Quassnoi
August 25th, 2009 at 11:00 pm
Posted in PostgreSQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
Subscribe in a reader