PostgreSQL 8.4: efficient MIN(COUNT(*))
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frances asks:
I have a table
user_viewswhich contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you'll just have to group by myuser, then order by COUNT(*).
With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:
WITH q AS ( SELECT myuser, COUNT(*) FROM user_pages GROUP BY myuser ) SELECT * FROM ( SELECT 'max'::TEXT AS which, myuser, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, myuser, count FROM q ORDER BY count LIMIT 1 ) qmin
However, it's takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let's create a sample table:
CREATE TABLE t_counter (id INT NOT NULL PRIMARY KEY, grouper INT NOT NULL, value VARCHAR(300) NOT NULL);
CREATE INDEX ix_counter_grouper ON t_counter (grouper);
INSERT
INTO t_counter
SELECT s, 100 - (s % (s / 10000 + 1)), RPAD('Value ' || s, 400, ' ')
FROM generate_series(1, 1000000) s;
ANALYZE t_counter;
Let's run our initial query:
WITH q AS ( SELECT grouper, COUNT(*) FROM t_counter GROUP BY grouper ) SELECT * FROM ( SELECT 'max'::TEXT AS which, grouper, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, grouper, count FROM q ORDER BY count LIMIT 1 ) qmin
| which | grouper | count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| max | 100 | 51833 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| min | 0 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 rows fetched in 0.0002s (2.4062s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Append (cost=56670.71..56673.25 rows=2 width=44)
CTE q
-> HashAggregate (cost=56666.96..56668.21 rows=100 width=4)
-> Seq Scan on t_counter (cost=0.00..51666.97 rows=999997 width=4)
-> Limit (cost=2.50..2.50 rows=1 width=12)
-> Sort (cost=2.50..2.75 rows=100 width=12)
Sort Key: q.count
-> CTE Scan on q (cost=0.00..2.00 rows=100 width=12)
-> Limit (cost=2.50..2.50 rows=1 width=12)
-> Sort (cost=2.50..2.75 rows=100 width=12)
Sort Key: q.count
-> CTE Scan on q (cost=0.00..2.00 rows=100 width=12)
2.4 seconds, quite long.
Now, we see that there is but one record with grouper = 0. That is, if we have seen this record we know that the minimal COUNT(*) will not ever exceed 1.
We need somehow to tell PostgreSQL to stop counting as soon as COUNT(*) for each next record exceeds this value.
To do this, we will use recursive CTEs.
Here's how we will build the query:
- Select the least
grouperCOUNT(*)for it and a literalTRUE. This boolean flag means that the current records holds the minimalCOUNT(*)calculated so far, which is of course true for the first records. - On each recursion step, select the
groupernext to the previously evaluated one in sequential order - For the newly selected grouper, select
COUNT(*)in a subquery, limiting the number of rows counted withLIMIT cnt + 1, wherecntis theCOUNT(*)from the previous step. Thus, no newCOUNT(*)can exceed the previous by more than one record. We need this extra record to be counted to tell if the current grouper also holds the minimal value ofCOUNT(*) - If the current value of
COUNT(*)exceeds the cumulative minimum, returnFALSEin theholds current minimum
flag - Return the
LEASTof currentCOUNT(*)and cumulative minumum
For each grouper, this query will return the cumulative minimum of COUNT(*)'s calculated so far along with the boolean flag, indicating that the current record's actual COUNT(*) is equal to this cumulative minimum (and hence, this record is eligible for returning as a result).
Then we just return the record with minimal value of COUNT(*) and the flag set.
Here's the query:
WITH RECURSIVE
dg AS
(
SELECT mingrouper AS grouper,
(
SELECT COUNT(*)
FROM t_counter
WHERE grouper = mingrouper
) AS cnt,
TRUE AS cmin
FROM (
SELECT MIN(grouper) AS mingrouper
FROM t_counter
) q
UNION ALL
SELECT grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin
FROM (
SELECT grouper,
(
SELECT COUNT(*)
FROM (
SELECT grouper
FROM t_counter ci
WHERE ci.grouper = q.grouper
LIMIT q.cnt + 1
) c
) AS ncnt, cnt
FROM (
SELECT (
SELECT cn.grouper
FROM t_counter cn
WHERE cn.grouper > dg.grouper
ORDER BY
cn.grouper
LIMIT 1
) grouper,
cnt
FROM dg
) q
WHERE grouper IS NOT NULL
) q2
)
SELECT *
FROM dg
WHERE cmin
ORDER BY
cnt DESC
| grouper | cnt | cmin | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (0.0175s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sort (cost=450651.85..450651.98 rows=50 width=13)
Sort Key: dg.cnt
CTE dg
-> Recursive Union (cost=1.54..450648.42 rows=101 width=12)
-> Subquery Scan q (cost=1.54..21335.71 rows=1 width=4)
-> Result (cost=1.54..1.55 rows=1 width=0)
InitPlan 2 (returns $2)
-> Limit (cost=0.00..1.54 rows=1 width=4)
-> Index Scan using ix_counter_grouper on t_counter (cost=0.00..1540080.78 rows=999997 width=4)
Filter: (grouper IS NOT NULL)
SubPlan 1
-> Aggregate (cost=21334.14..21334.15 rows=1 width=0)
-> Index Scan using ix_counter_grouper on t_counter (cost=0.00..21309.13 rows=10000 width=0)
Index Cond: (grouper = $1)
-> WorkTable Scan on dg (cost=0.00..42931.07 rows=10 width=12)
Filter: ((SubPlan 8) IS NOT NULL)
SubPlan 3
-> Limit (cost=0.00..1.56 rows=1 width=4)
-> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4)
Index Cond: (grouper > $3)
SubPlan 5
-> Aggregate (cost=2144.97..2144.98 rows=1 width=0)
-> Limit (cost=1.56..2132.47 rows=1000 width=4)
InitPlan 4 (returns $4)
-> Limit (cost=0.00..1.56 rows=1 width=4)
-> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4)
Index Cond: (grouper > $3)
-> Index Scan using ix_counter_grouper on t_counter ci (cost=0.00..21309.13 rows=10000 width=4)
Index Cond: (grouper = $4)
SubPlan 7
-> Aggregate (cost=2144.97..2144.98 rows=1 width=0)
-> Limit (cost=1.56..2132.47 rows=1000 width=4)
InitPlan 6 (returns $6)
-> Limit (cost=0.00..1.56 rows=1 width=4)
-> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4)
Index Cond: (grouper > $3)
-> Index Scan using ix_counter_grouper on t_counter ci (cost=0.00..21309.13 rows=10000 width=4)
Index Cond: (grouper = $6)
SubPlan 8
-> Limit (cost=0.00..1.56 rows=1 width=4)
-> Index Scan using ix_counter_grouper on t_counter cn (cost=0.00..519394.62 rows=333332 width=4)
Index Cond: (grouper > $3)
-> CTE Scan on dg (cost=0.00..2.02 rows=50 width=13)
Filter: cmin
Only 10 ms, almost instant.
Let's try the same for groupers that have more rows, like starting from the grouper = 10:
WITH RECURSIVE
dg AS
(
SELECT mingrouper AS grouper,
(
SELECT COUNT(*)
FROM t_counter
WHERE grouper = mingrouper
) AS cnt,
TRUE AS cmin
FROM (
SELECT MIN(grouper) AS mingrouper
FROM t_counter
WHERE grouper >= 10
) q
UNION ALL
SELECT grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin
FROM (
SELECT grouper,
(
SELECT COUNT(*)
FROM (
SELECT grouper
FROM t_counter ci
WHERE ci.grouper = q.grouper
LIMIT q.cnt + 1
) c
) AS ncnt, cnt
FROM (
SELECT (
SELECT cn.grouper
FROM t_counter cn
WHERE cn.grouper > dg.grouper
ORDER BY
cn.grouper
LIMIT 1
) grouper,
cnt
FROM dg
) q
WHERE grouper IS NOT NULL
) q2
)
SELECT *
FROM dg
WHERE cmin
ORDER BY
cnt DESC
| grouper | cnt | cmin | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1052 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (0.4533s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This of course takes a little bit longer, since the minimal COUNT(*) is 1052 and for each grouper the engine needs to count up to this value.
However, it's still a lot more efficient than the original query.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Subscribe in a reader