EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL 8.4: efficient MIN(COUNT(*))

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Frances asks:

I have a table user_views which contains pages viewed by users.

How do I select the user that had the most and the fewest number of page views?

If you need them at the same time, you'll just have to group by myuser, then order by COUNT(*).

With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:

WITH    q AS
(
SELECT  myuser, COUNT(*)
FROM    user_pages
GROUP BY
myuser
)
SELECT  *
FROM    (
SELECT  'max'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count DESC
LIMIT 1
) qmax
UNION ALL
SELECT  *
FROM    (
SELECT  'min'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count
LIMIT 1
) qmin

However, it's takes quite a long time if you table is big.

This is probably an answer to the question you asked.

But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.

The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.

Let's create a sample table:

CREATE TABLE t_counter (id INT NOT NULL PRIMARY KEY, grouper INT NOT NULL, value VARCHAR(300) NOT NULL);

CREATE INDEX ix_counter_grouper ON t_counter (grouper);

INSERT
INTO    t_counter
SELECT  s, 100 - (s % (s / 10000 + 1)), RPAD('Value ' || s, 400, ' ')
FROM    generate_series(1, 1000000) s;

ANALYZE t_counter;

Let's run our initial query:

WITH    q AS
(
SELECT  grouper, COUNT(*)
FROM    t_counter
GROUP BY
grouper
)
SELECT  *
FROM    (
SELECT  'max'::TEXT AS which, grouper, count
FROM    q
ORDER BY
count DESC
LIMIT 1
) qmax
UNION ALL
SELECT  *
FROM    (
SELECT  'min'::TEXT AS which, grouper, count
FROM    q
ORDER BY
count
LIMIT 1
) qmin

which grouper count
max 100 51833
min 0 1
2 rows fetched in 0.0002s (2.4062s)
Append  (cost=56670.71..56673.25 rows=2 width=44)
  CTE q
    ->  HashAggregate  (cost=56666.96..56668.21 rows=100 width=4)
          ->  Seq Scan on t_counter  (cost=0.00..51666.97 rows=999997 width=4)
  ->  Limit  (cost=2.50..2.50 rows=1 width=12)
        ->  Sort  (cost=2.50..2.75 rows=100 width=12)
              Sort Key: q.count
              ->  CTE Scan on q  (cost=0.00..2.00 rows=100 width=12)
  ->  Limit  (cost=2.50..2.50 rows=1 width=12)
        ->  Sort  (cost=2.50..2.75 rows=100 width=12)
              Sort Key: q.count
              ->  CTE Scan on q  (cost=0.00..2.00 rows=100 width=12)

2.4 seconds, quite long.

Now, we see that there is but one record with grouper = 0. That is, if we have seen this record we know that the minimal COUNT(*) will not ever exceed 1.

We need somehow to tell PostgreSQL to stop counting as soon as COUNT(*) for each next record exceeds this value.

To do this, we will use recursive CTEs.

Here's how we will build the query:

  1. Select the least grouper COUNT(*) for it and a literal TRUE. This boolean flag means that the current records holds the minimal COUNT(*) calculated so far, which is of course true for the first records.
  2. On each recursion step, select the grouper next to the previously evaluated one in sequential order
  3. For the newly selected grouper, select COUNT(*) in a subquery, limiting the number of rows counted with LIMIT cnt + 1, where cnt is the COUNT(*) from the previous step. Thus, no new COUNT(*) can exceed the previous by more than one record. We need this extra record to be counted to tell if the current grouper also holds the minimal value of COUNT(*)
  4. If the current value of COUNT(*) exceeds the cumulative minimum, return FALSE in the holds current minimum flag
  5. Return the LEAST of current COUNT(*) and cumulative minumum

For each grouper, this query will return the cumulative minimum of COUNT(*)'s calculated so far along with the boolean flag, indicating that the current record's actual COUNT(*) is equal to this cumulative minimum (and hence, this record is eligible for returning as a result).

Then we just return the record with minimal value of COUNT(*) and the flag set.

Here's the query:

WITH RECURSIVE
dg AS
(
SELECT  mingrouper AS grouper,
(
SELECT  COUNT(*)
FROM    t_counter
WHERE   grouper = mingrouper
) AS cnt,
TRUE AS cmin
FROM    (
SELECT  MIN(grouper) AS mingrouper
FROM    t_counter
) q
UNION ALL
SELECT  grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin
        FROM    (
                SELECT  grouper,
                        (
                        SELECT  COUNT(*)
                        FROM    (
                                SELECT  grouper
                                FROM    t_counter ci
                                WHERE   ci.grouper = q.grouper
                                LIMIT   q.cnt + 1
                                ) c
                        ) AS ncnt, cnt
                FROM    (
                        SELECT  (
                                SELECT  cn.grouper
                                FROM    t_counter cn
                                WHERE   cn.grouper > dg.grouper
ORDER BY
cn.grouper
LIMIT 1
) grouper,
cnt
FROM    dg
) q
WHERE   grouper IS NOT NULL
) q2
)
SELECT  *
FROM    dg
WHERE   cmin
ORDER BY
cnt DESC

grouper cnt cmin
0 1 1
1 row fetched in 0.0002s (0.0175s)
Sort  (cost=450651.85..450651.98 rows=50 width=13)
  Sort Key: dg.cnt
  CTE dg
    ->  Recursive Union  (cost=1.54..450648.42 rows=101 width=12)
          ->  Subquery Scan q  (cost=1.54..21335.71 rows=1 width=4)
                ->  Result  (cost=1.54..1.55 rows=1 width=0)
                      InitPlan 2 (returns $2)
                        ->  Limit  (cost=0.00..1.54 rows=1 width=4)
                              ->  Index Scan using ix_counter_grouper on t_counter  (cost=0.00..1540080.78 rows=999997 width=4)
                                    Filter: (grouper IS NOT NULL)
                SubPlan 1
                  ->  Aggregate  (cost=21334.14..21334.15 rows=1 width=0)
                        ->  Index Scan using ix_counter_grouper on t_counter  (cost=0.00..21309.13 rows=10000 width=0)
                              Index Cond: (grouper = $1)
          ->  WorkTable Scan on dg  (cost=0.00..42931.07 rows=10 width=12)
                Filter: ((SubPlan 8) IS NOT NULL)
                SubPlan 3
                  ->  Limit  (cost=0.00..1.56 rows=1 width=4)
                        ->  Index Scan using ix_counter_grouper on t_counter cn  (cost=0.00..519394.62 rows=333332 width=4)
                              Index Cond: (grouper > $3)
                SubPlan 5
                  ->  Aggregate  (cost=2144.97..2144.98 rows=1 width=0)
                        ->  Limit  (cost=1.56..2132.47 rows=1000 width=4)
                              InitPlan 4 (returns $4)
                                ->  Limit  (cost=0.00..1.56 rows=1 width=4)
                                      ->  Index Scan using ix_counter_grouper on t_counter cn  (cost=0.00..519394.62 rows=333332 width=4)
                                            Index Cond: (grouper > $3)
                              ->  Index Scan using ix_counter_grouper on t_counter ci  (cost=0.00..21309.13 rows=10000 width=4)
                                    Index Cond: (grouper = $4)
                SubPlan 7
                  ->  Aggregate  (cost=2144.97..2144.98 rows=1 width=0)
                        ->  Limit  (cost=1.56..2132.47 rows=1000 width=4)
                              InitPlan 6 (returns $6)
                                ->  Limit  (cost=0.00..1.56 rows=1 width=4)
                                      ->  Index Scan using ix_counter_grouper on t_counter cn  (cost=0.00..519394.62 rows=333332 width=4)
                                            Index Cond: (grouper > $3)
                              ->  Index Scan using ix_counter_grouper on t_counter ci  (cost=0.00..21309.13 rows=10000 width=4)
                                    Index Cond: (grouper = $6)
                SubPlan 8
                  ->  Limit  (cost=0.00..1.56 rows=1 width=4)
                        ->  Index Scan using ix_counter_grouper on t_counter cn  (cost=0.00..519394.62 rows=333332 width=4)
                              Index Cond: (grouper > $3)
  ->  CTE Scan on dg  (cost=0.00..2.02 rows=50 width=13)
        Filter: cmin

Only 10 ms, almost instant.

Let's try the same for groupers that have more rows, like starting from the grouper = 10:

WITH RECURSIVE
dg AS
(
SELECT  mingrouper AS grouper,
(
SELECT  COUNT(*)
FROM    t_counter
WHERE   grouper = mingrouper
) AS cnt,
TRUE AS cmin
FROM    (
SELECT  MIN(grouper) AS mingrouper
FROM    t_counter
WHERE   grouper >= 10
) q
UNION ALL
SELECT  grouper, LEAST(cnt, ncnt), (ncnt < cnt) AS cmin
        FROM    (
                SELECT  grouper,
                        (
                        SELECT  COUNT(*)
                        FROM    (
                                SELECT  grouper
                                FROM    t_counter ci
                                WHERE   ci.grouper = q.grouper
                                LIMIT   q.cnt + 1
                                ) c
                        ) AS ncnt, cnt
                FROM    (
                        SELECT  (
                                SELECT  cn.grouper
                                FROM    t_counter cn
                                WHERE   cn.grouper > dg.grouper
ORDER BY
cn.grouper
LIMIT 1
) grouper,
cnt
FROM    dg
) q
WHERE   grouper IS NOT NULL
) q2
)
SELECT  *
FROM    dg
WHERE   cmin
ORDER BY
cnt DESC

grouper cnt cmin
10 1052 1
1 row fetched in 0.0002s (0.4533s)

This of course takes a little bit longer, since the minimal COUNT(*) is 1052 and for each grouper the engine needs to count up to this value.

However, it's still a lot more efficient than the original query.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

July 29th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply