Aggregate AND
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table with a foreign key and a boolean value (and a bunch of other columns that aren't relevant here), as such:
CREATE TABLE myTable ( someKey integer, someBool boolean ); INSERT INTO myTable VALUES (1, 't'), (1, 't'), (2, 'f'), (2, 't');Each
someKeycould have 0 or more entries.For any given
someKey, I need to know if
- All the entries are true, or
- Any of the entries are false
Basically, it's an
AND.
This solution is often used to represent polls that should be unanimous for the decision to be made (i. e. anyone can put a veto on the decision).
PostgreSQL offers a special aggregate BOOL_AND to do this.
However, an aggregate may be less efficient here.
The return value of an AND function is constrained by finding certain values:
- Whenever a
FALSEis found, the return value cannot beTRUEanymore. It's eitherFALSEorNULL. - Whenever a
NULLis found, the return value isNULL
Aggregates in PostgreSQL, however, won't take this into account.
What we need here is a method to stop and return whenever first NULL or FALSE value is found.
Let's create a sample table and see how can it may be done:
CREATE TABLE t_and (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
vote BOOLEAN
);
SELECT setseed(20090514);
BEGIN TRANSACTION;
INSERT
INTO t_and
SELECT s, s / 10 + 1, (RANDOM() > 0.00001)::boolean
FROM generate_series(1, 1000000) s;
COMMIT;
ANALYZE t_and;
This table has 1,000,000 rows in 100,000 groups, with 10 votes in each group.
Votes have one chance of 100,000 to be a con. There turned out to be 8 votes con:
SELECT * FROM t_and WHERE NOT VOTE
| id | grouper | vote |
|---|---|---|
| 202143 | 20215 | 0 |
| 312766 | 31277 | 0 |
| 459159 | 45916 | 0 |
| 533946 | 53395 | 0 |
| 595653 | 59566 | 0 |
| 643821 | 64383 | 0 |
| 871114 | 87112 | 0 |
| 980017 | 98002 | 0 |
I intentionally made so few con's to demonstrate the difference.
Let's write a query that shows us how many decisions passed and how many did not:
SELECT decision, COUNT(*)
FROM (
SELECT grouper, BOOL_AND(vote) AS decision
FROM t_and
GROUP BY
grouper
) q
GROUP BY
decision
| decision | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 99993 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 rows fetched in 0.0002s (2.7719s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HashAggregate (cost=151436.80..151436.82 rows=2 width=1)
-> GroupAggregate (cost=142395.56..150595.76 rows=56069 width=5)
-> Sort (cost=142395.56..144895.34 rows=999912 width=5)
Sort Key: t_and.grouper
-> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
and runs for almost 3 seconds.
Now, we need this function to stop whenever it finds a false value.
To do this, we will employ PostgreSQL's DISTINCT ON (expression) clause.
This clause, being invoked after ORDER BY, selects only one (first) row per each DISTINCT value of the expression.
ORDER BY clause defines which row is considered first. For this clause to work, expression should be first in the ORDER BY list.
This is an analog of ROW_NUMBER() OVER (PARTITION BY expression) = 1 in Oracle and SQL Server.
Let's check it:
SELECT decision, COUNT(*)
FROM (
SELECT DISTINCT ON (grouper) grouper, vote AS decision
FROM t_and
ORDER BY
grouper, vote NULLS FIRST
) q
GROUP BY
decision
| decision | count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 99993 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 rows fetched in 0.0002s (1.9469s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HashAggregate (cost=148236.15..148236.18 rows=2 width=1)
-> Unique (cost=142395.56..147395.12 rows=56069 width=5)
-> Sort (cost=142395.56..144895.34 rows=999912 width=5)
Sort Key: t_and.grouper, t_and.vote
-> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
Now, it's less than 2 seconds: replacing GroupAggregate with Unique alone yields a 40% performance gain.
What's happening here?
For each given grouper, the first vote is selected.
Which vote is first is identified by ORDER BY clause: ORDER BY grouper, vote NULLS FIRST. That is NULL's come first, then FALSE's, then TRUE's.
It's easy to see that the first returned vote identifies the return value:
- If it's
NULL, it automatically makes the wholeANDto beNULL - If it's
FALSE, then there were noNULL's (or they would be returned first). The return value isFALSEalso - Finally, if the first
voteisTRUE, then all othervote's are alsoTRUE's, and so is the result
Now, let's try to select only those polls that were vetoed.
First, with the aggregates:
SELECT grouper
FROM t_and
GROUP BY
grouper
HAVING BOOL_AND(vote) IS FALSE
| grouper | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20215 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31277 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 45916 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 53395 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 59566 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 64383 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 87112 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 98002 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0.0002s (2.7406s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GroupAggregate (cost=142395.56..150595.76 rows=56069 width=5)
Filter: (bool_and(vote) IS FALSE)
-> Sort (cost=142395.56..144895.34 rows=999912 width=5)
Sort Key: grouper
-> Seq Scan on t_and (cost=0.00..15405.12 rows=999912 width=5)
This performs just like the original query: it still need to count all the aggregates.
Now, using DISTINCT ON:
SELECT DISTINCT ON (grouper) grouper, vote AS decision
FROM t_and
WHERE NOT vote OR vote IS NULL
ORDER BY
grouper, vote NULLS FIRST
| grouper | decision | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20215 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31277 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 45916 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 53395 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 59566 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 64383 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 87112 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 98002 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 rows fetched in 0.0003s (0.2485s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unique (cost=15405.13..15405.14 rows=1 width=5)
-> Sort (cost=15405.13..15405.14 rows=1 width=5)
Sort Key: grouper, vote
-> Seq Scan on t_and (cost=0.00..15405.12 rows=1 width=5)
Filter: ((NOT vote) OR (vote IS NULL))
And this query runs more than 10 times as fast as the aggregate.
This is because it filters out the TRUE's first and has much fewer record to be sorted and uniquified.
Subscribe in a reader