Aggregate AND
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table with a foreign key and a boolean value (and a bunch of other columns that aren't relevant here), as such:
CREATE TABLE myTable ( someKey integer, someBool boolean ); INSERT INTO myTable VALUES (1, 't'), (1, 't'), (2, 'f'), (2, 't');Each
someKey
could have 0 or more entries.For any given
someKey
, I need to know if
- All the entries are true, or
- Any of the entries are false
Basically, it's an
AND
.
This solution is often used to represent polls that should be unanimous for the decision to be made (i. e. anyone can put a veto on the decision).
PostgreSQL offers a special aggregate BOOL_AND
to do this.
However, an aggregate may be less efficient here.
The return value of an AND
function is constrained by finding certain values:
- Whenever a
FALSE
is found, the return value cannot beTRUE
anymore. It's eitherFALSE
orNULL
. - Whenever a
NULL
is found, the return value isNULL
Aggregates in PostgreSQL, however, won't take this into account.
What we need here is a method to stop and return whenever first NULL
or FALSE
value is found.
Let's create a sample table and see how can it may be done:
CREATE TABLE t_and ( id INT NOT NULL PRIMARY KEY, grouper INT NOT NULL, vote BOOLEAN ); SELECT setseed(20090514); BEGIN TRANSACTION; INSERT INTO t_and SELECT s, s / 10 + 1, (RANDOM() > 0.00001)::boolean FROM generate_series(1, 1000000) s; COMMIT; ANALYZE t_and;
This table has 1,000,000 rows in 100,000 groups, with 10 votes in each group.
Votes have one chance of 100,000 to be a con. There turned out to be 8 votes con:
SELECT * FROM t_and WHERE NOT VOTE
id | grouper | vote |
---|---|---|
202143 | 20215 | 0 |
312766 | 31277 | 0 |
459159 | 45916 | 0 |
533946 | 53395 | 0 |
595653 | 59566 | 0 |
643821 | 64383 | 0 |
871114 | 87112 | 0 |
980017 | 98002 | 0 |
I intentionally made so few con's to demonstrate the difference.
Let's write a query that shows us how many decisions passed and how many did not:
SELECT decision, COUNT(*) FROM ( SELECT grouper, BOOL_AND(vote) AS decision FROM t_and GROUP BY grouper ) q GROUP BY decision
decision | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 99993 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0002s (2.7719s) |
HashAggregate (cost=151436.80..151436.82 rows=2 width=1) -> GroupAggregate (cost=142395.56..150595.76 rows=56069 width=5) -> Sort (cost=142395.56..144895.34 rows=999912 width=5) Sort Key: t_and.grouper -> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
and runs for almost 3 seconds.
Now, we need this function to stop whenever it finds a false value.
To do this, we will employ PostgreSQL's DISTINCT ON (expression)
clause.
This clause, being invoked after ORDER BY
, selects only one (first) row per each DISTINCT
value of the expression
.
ORDER BY
clause defines which row is considered first. For this clause to work, expression should be first in the ORDER BY
list.
This is an analog of ROW_NUMBER() OVER (PARTITION BY expression) = 1
in Oracle
and SQL Server
.
Let's check it:
SELECT decision, COUNT(*) FROM ( SELECT DISTINCT ON (grouper) grouper, vote AS decision FROM t_and ORDER BY grouper, vote NULLS FIRST ) q GROUP BY decision
decision | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 99993 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0002s (1.9469s) |
HashAggregate (cost=148236.15..148236.18 rows=2 width=1) -> Unique (cost=142395.56..147395.12 rows=56069 width=5) -> Sort (cost=142395.56..144895.34 rows=999912 width=5) Sort Key: t_and.grouper, t_and.vote -> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
Now, it's less than 2
seconds: replacing GroupAggregate
with Unique
alone yields a 40% performance gain.
What's happening here?
For each given grouper
, the first vote
is selected.
Which vote
is first is identified by ORDER BY
clause: ORDER BY grouper, vote NULLS FIRST
. That is NULL
's come first, then FALSE
's, then TRUE
's.
It's easy to see that the first returned vote
identifies the return value:
- If it's
NULL
, it automatically makes the wholeAND
to beNULL
- If it's
FALSE
, then there were noNULL
's (or they would be returned first). The return value isFALSE
also - Finally, if the first
vote
isTRUE
, then all othervote
's are alsoTRUE
's, and so is the result
Now, let's try to select only those polls that were vetoed.
First, with the aggregates:
SELECT grouper FROM t_and GROUP BY grouper HAVING BOOL_AND(vote) IS FALSE
grouper | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20215 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
31277 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45916 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
53395 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
59566 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64383 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87112 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
98002 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0002s (2.7406s) |
GroupAggregate (cost=142395.56..150595.76 rows=56069 width=5) Filter: (bool_and(vote) IS FALSE) -> Sort (cost=142395.56..144895.34 rows=999912 width=5) Sort Key: grouper -> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
This performs just like the original query: it still need to count all the aggregates.
Now, using DISTINCT ON
:
SELECT DISTINCT ON (grouper) grouper, vote AS decision FROM t_and WHERE NOT vote OR vote IS NULL ORDER BY grouper, vote NULLS FIRST
grouper | decision | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20215 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
31277 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45916 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
53395 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
59566 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64383 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87112 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
98002 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 rows fetched in 0.0003s (0.2485s) |
Unique (cost=15405.13..15405.14 rows=1 width=5) -> Sort (cost=15405.13..15405.14 rows=1 width=5) Sort Key: grouper, vote -> Seq Scan on t_and (cost=0.00..15405.12 rows=1 width=5) Filter: ((NOT vote) OR (vote IS NULL))
And this query runs more than 10 times as fast as the aggregate.
This is because it filters out the TRUE
's first and has much fewer record to be sorted and uniquified.