PostgreSQL 8.4: efficient MIN(COUNT(*))
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frances asks:
I have a table
user_views
which contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you'll just have to group by myuser
, then order by COUNT(*)
.
With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:
WITH q AS ( SELECT myuser, COUNT(*) FROM user_pages GROUP BY myuser ) SELECT * FROM ( SELECT 'max'::TEXT AS which, myuser, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, myuser, count FROM q ORDER BY count LIMIT 1 ) qmin
However, it's takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*)
calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let's create a sample table:
CREATE TABLE t_counter (id INT NOT NULL PRIMARY KEY, grouper INT NOT NULL, value VARCHAR(300) NOT NULL); CREATE INDEX ix_counter_grouper ON t_counter (grouper); INSERT INTO t_counter SELECT s, 100 - (s % (s / 10000 + 1)), RPAD('Value ' || s, 400, ' ') FROM generate_series(1, 1000000) s; ANALYZE t_counter;
Let's run our initial query:
WITH q AS ( SELECT grouper, COUNT(*) FROM t_counter GROUP BY grouper ) SELECT * FROM ( SELECT 'max'::TEXT AS which, grouper, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, grouper, count FROM q ORDER BY count LIMIT 1 ) qmin
which | grouper | count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
max | 100 | 51833 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
min | 0 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0002s (2.4062s) |
Append (cost=56670.71..56673.25 rows=2 width=44) CTE q -> HashAggregate (cost=56666.96..56668.21 rows=100 width=4) -> Seq Scan on t_counter (cost=0.00..51666.97 rows=999997 width=4) -> Limit (cost=2.50..2.50 rows=1 width=12) -> Sort (cost=2.50..2.75 rows=100 width=12) Sort Key: q.count -> CTE Scan on q (cost=0.00..2.00 rows=100 width=12) -> Limit (cost=2.50..2.50 rows=1 width=12) -> Sort (cost=2.50..2.75 rows=100 width=12) Sort Key: q.count -> CTE Scan on q (cost=0.00..2.00 rows=100 width=12)
2.4 seconds, quite long.
Now, we see that there is but one record with grouper = 0
. That is, if we have seen this record we know that the minimal COUNT(*)
will not ever exceed 1.
We need somehow to tell PostgreSQL to stop counting as soon as COUNT(*)
for each next record exceeds this value.
To do this, we will use recursive CTEs.
Here's how we will build the query:
- Select the least
grouper
COUNT(*)
for it and a literalTRUE
. This boolean flag means that the current records holds the minimalCOUNT(*)
calculated so far, which is of course true for the first records. - On each recursion step, select the
grouper
next to the previously evaluated one in sequential order - For the newly selected grouper, select
COUNT(*)
in a subquery, limiting the number of rows counted withLIMIT cnt + 1
, wherecnt
is theCOUNT(*)
from the previous step. Thus, no newCOUNT(*)
can exceed the previous by more than one record. We need this extra record to be counted to tell if the current grouper also holds the minimal value ofCOUNT(*)
- If the current value of
COUNT(*)
exceeds the cumulative minimum, returnFALSE
in theholds current minimum
flag - Return the
LEAST
of currentCOUNT(*)
and cumulative minumum
For each grouper
, this query will return the cumulative minimum of COUNT(*)
's calculated so far along with the boolean flag, indicating that the current record's actual COUNT(*)
is equal to this cumulative minimum (and hence, this record is eligible for returning as a result).
Then we just return the record with minimal value of COUNT(*)
and the flag set.
Here's the query:
WITH RECURSIVE dg AS ( SELECT mingrouper AS grouper, ( SELECT COUNT(*) FROM t_counter WHERE grouper = mingrouper ) AS cnt, TRUE AS cmin FROM ( SELECT MIN(grouper) AS mingrouper FROM t_counter ) q UNION ALL SELECT grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin FROM ( SELECT grouper, ( SELECT COUNT(*) FROM ( SELECT grouper FROM t_counter ci WHERE ci.grouper = q.grouper LIMIT q.cnt + 1 ) c ) AS ncnt, cnt FROM ( SELECT ( SELECT cn.grouper FROM t_counter cn WHERE cn.grouper > dg.grouper ORDER BY cn.grouper LIMIT 1 ) grouper, cnt FROM dg ) q WHERE grouper IS NOT NULL ) q2 ) SELECT * FROM dg WHERE cmin ORDER BY cnt DESC
grouper | cnt | cmin | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0175s) |
Sort (cost=450651.85..450651.98 rows=50 width=13) Sort Key: dg.cnt CTE dg -> Recursive Union (cost=1.54..450648.42 rows=101 width=12) -> Subquery Scan q (cost=1.54..21335.71 rows=1 width=4) -> Result (cost=1.54..1.55 rows=1 width=0) InitPlan 2 (returns $2) -> Limit (cost=0.00..1.54 rows=1 width=4) -> Index Scan using ix_counter_grouper on t_counter (cost=0.00..1540080.78 rows=999997 width=4) Filter: (grouper IS NOT NULL) SubPlan 1 -> Aggregate (cost=21334.14..21334.15 rows=1 width=0) -> Index Scan using ix_counter_grouper on t_counter (cost=0.00..21309.13 rows=10000 width=0) Index Cond: (grouper = $1) -> WorkTable Scan on dg (cost=0.00..42931.07 rows=10 width=12) Filter: ((SubPlan 8) IS NOT NULL) SubPlan 3 -> Limit (cost=0.00..1.56 rows=1 width=4) -> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4) Index Cond: (grouper > $3) SubPlan 5 -> Aggregate (cost=2144.97..2144.98 rows=1 width=0) -> Limit (cost=1.56..2132.47 rows=1000 width=4) InitPlan 4 (returns $4) -> Limit (cost=0.00..1.56 rows=1 width=4) -> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4) Index Cond: (grouper > $3) -> Index Scan using ix_counter_grouper on t_counter ci (cost=0.00..21309.13 rows=10000 width=4) Index Cond: (grouper = $4) SubPlan 7 -> Aggregate (cost=2144.97..2144.98 rows=1 width=0) -> Limit (cost=1.56..2132.47 rows=1000 width=4) InitPlan 6 (returns $6) -> Limit (cost=0.00..1.56 rows=1 width=4) -> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4) Index Cond: (grouper > $3) -> Index Scan using ix_counter_grouper on t_counter ci (cost=0.00..21309.13 rows=10000 width=4) Index Cond: (grouper = $6) SubPlan 8 -> Limit (cost=0.00..1.56 rows=1 width=4) -> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4) Index Cond: (grouper > $3) -> CTE Scan on dg (cost=0.00..2.02 rows=50 width=13) Filter: cmin
Only 10 ms, almost instant.
Let's try the same for groupers that have more rows, like starting from the grouper = 10
:
WITH RECURSIVE dg AS ( SELECT mingrouper AS grouper, ( SELECT COUNT(*) FROM t_counter WHERE grouper = mingrouper ) AS cnt, TRUE AS cmin FROM ( SELECT MIN(grouper) AS mingrouper FROM t_counter WHERE grouper >= 10 ) q UNION ALL SELECT grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin FROM ( SELECT grouper, ( SELECT COUNT(*) FROM ( SELECT grouper FROM t_counter ci WHERE ci.grouper = q.grouper LIMIT q.cnt + 1 ) c ) AS ncnt, cnt FROM ( SELECT ( SELECT cn.grouper FROM t_counter cn WHERE cn.grouper > dg.grouper ORDER BY cn.grouper LIMIT 1 ) grouper, cnt FROM dg ) q WHERE grouper IS NOT NULL ) q2 ) SELECT * FROM dg WHERE cmin ORDER BY cnt DESC
grouper | cnt | cmin | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 1052 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.4533s) |
This of course takes a little bit longer, since the minimal COUNT(*)
is 1052 and for each grouper
the engine needs to count up to this value.
However, it's still a lot more efficient than the original query.
Hope that helps.
I'm always glad to answer the questions regarding database queries.