PostgreSQL: selecting items that belong to all categories
Comments enabled. I *really* need your comment
From Stack Overflow:
I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project Category Foo Apple Foo Banana Foo Carrot Bar Apple Bar Carrot Qux Apple Qux Banana What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories.
For example, if a user selects categories Apple and Banana, projects Foo and Qux show up.
If a user select categories Apple, Banana, and Carrot then only the Foo project shows up.
A usual approach to this problem involves GROUP BY
and COUNT
. However, this can be made more efficient.
Let's create a sample table:
CREATE TABLE t_category ( item INT NOT NULL, category INT NOT NULL ); ALTER TABLE t_category ADD CONSTRAINT pk_category_item_category PRIMARY KEY (category, item); SELECT SETSEED(0.20090723); INSERT INTO t_category SELECT item, category FROM ( SELECT item, category, RANDOM() AS rnd FROM generate_series(1, 100000) item CROSS JOIN generate_series(1, 20) category ) q WHERE rnd < 0.1 OR category <= 10; ANALYZE t_category; [/sourcecode] </div> We have <strong>100,000</strong> items with categories from <strong>1</strong> to <strong>20</strong> randomly assigned to each item. Categories from <strong>1</strong> to <strong>10</strong> are assigned always (i. e. all items belong to them), while categories from <strong>11</strong> to <strong>20</strong> are assigned randomly, each having <strong>10%</strong> probability of being assigned. Now, let's select all items that belong to all categories from <strong>1</strong> to <strong>14</strong>. A typical query looks like this: SELECT item FROM t_category WHERE category IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) GROUP BY item HAVING COUNT(*) = 14
See results and execution plan
item | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10310 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
90050 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
22510 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
28357 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1096 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17567 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
60911 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0003s (0.9725s) |
HashAggregate (cost=38571.15..39506.36 rows=62347 width=4) Filter: (count(*) = 14) -> Seq Scan on t_category (cost=0.00..35116.96 rows=690839 width=4) Filter: (category = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::integer[]))
This works for almost 1 second.
Counting all categories takes too long in this solution.
But could we just stop counting when it's obvious that an item does not satisfy the condition (say, one of the categories was not found)?
To do this, we will use a nested NOT EXISTS
construct.
The query will work as following:
First, we select a list of DISTINCT
items
for the least used category.
For the design like this it's very useful to collect statistics once a day and keep these statistics in categories
table (how many items a category contains).
However, for our database we just know that categories from 1 to 10 are assigned to 100,000 items each, and categories from 11 to 20 are assigned to but 10,000 items in average.
That's why we can just pick a random category from 11 to 20, say, 14.
Then we check that all categories in the list (but the first one) do exist for each of the items contained in category 14.
To do this, we generate a resultset of conditions inside a NOT EXIST
construct, and check each of the conditions against the table also with a NOT EXISTS
condition. If but a single condition will not be found for a given item, both constructs will return FALSE
immediately and the record will be skipped instantly.
We don't need to check for category 14 in the NOT EXISTS
condition, since the resultset we are checking is guaranteed to contain only the items within category 14.
To build the resultset for the outermost NOT EXISTS
, we will use generate_series
and use its value as a subscript to an array of categories.
Here's the query:
SELECT * FROM ( SELECT DISTINCT item FROM t_category WHERE category = 14 ) mo WHERE NOT EXISTS ( SELECT NULL FROM generate_series(1, 13) s WHERE NOT EXISTS ( SELECT NULL FROM t_category mi WHERE mi.item = mo.item AND mi.category = ('{13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}'::INTEGER[])[s] ) )
See results and execution plan
item | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
90050 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1096 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10310 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17567 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
28357 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
22510 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
60911 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0002s (0.0978s) |
Nested Loop Anti Join (cost=5326.22..4035945.55 rows=472 width=4) Join Filter: (NOT (alternatives: SubPlan 1 or hashed SubPlan 2)) -> HashAggregate (cost=5326.22..5335.66 rows=944 width=4) -> Bitmap Heap Scan on t_category (cost=301.47..5300.09 rows=10450 width=4) Recheck Cond: (category = 14) -> Bitmap Index Scan on pk_category_item_category (cost=0.00..298.85 rows=10450 width=0) Index Cond: (category = 14) -> Function Scan on generate_series s (cost=0.00..12.50 rows=1000 width=4) SubPlan 1 -> Index Scan using pk_category_item_category on t_category mi (cost=0.00..8.50 rows=1 width=0) Index Cond: ((category = ('{13,12,11,10,9,8,7,6,5,4,3,2,1}'::integer[])[$1]) AND (item = $0)) SubPlan 2 -> Seq Scan on t_category mi (cost=0.00..15867.62 rows=1099962 width=8)
This completes in 90 ms: more then 10 times faster than GROUP BY
solution.
Note that we put the categories into array in certain order: from the least probable ones to the most probable ones.
Thus, the categories that are more likely to fail are checked first, so that the record could be skipped earlier if it is not contained in all the categories required.
Written by Quassnoi
August 25th, 2009 at 11:00 pm
Posted in PostgreSQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009