EXPLAIN EXTENDED

How to create fast database queries

Aggregate AND

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table with a foreign key and a boolean value (and a bunch of other columns that aren't relevant here), as such:

CREATE TABLE myTable
(        someKey integer,
        someBool boolean
);
INSERT
INTO    myTable
VALUES
        (1, 't'),
        (1, 't'),
        (2, 'f'),
        (2, 't');

Each someKey could have 0 or more entries.

For any given someKey, I need to know if

  1. All the entries are true, or
  2. Any of the entries are false

Basically, it's an AND.

This solution is often used to represent polls that should be unanimous for the decision to be made (i. e. anyone can put a veto on the decision).

PostgreSQL offers a special aggregate BOOL_AND to do this.

However, an aggregate may be less efficient here.

The return value of an AND function is constrained by finding certain values:

  1. Whenever a FALSE is found, the return value cannot be TRUE anymore. It's either FALSE or NULL.
  2. Whenever a NULL is found, the return value is NULL

Aggregates in PostgreSQL, however, won't take this into account.

What we need here is a method to stop and return whenever first NULL or FALSE value is found.

Let's create a sample table and see how can it may be done:

CREATE TABLE t_and (
        id INT NOT NULL PRIMARY KEY,
        grouper INT NOT NULL,
        vote BOOLEAN
);

SELECT setseed(20090514);

BEGIN TRANSACTION;

INSERT
INTO    t_and
SELECT  s, s / 10 + 1, (RANDOM() > 0.00001)::boolean
FROM    generate_series(1, 1000000) s;

COMMIT;

ANALYZE t_and;

This table has 1,000,000 rows in 100,000 groups, with 10 votes in each group.

Votes have one chance of 100,000 to be a con. There turned out to be 8 votes con:

SELECT  *
FROM    t_and
WHERE   NOT VOTE
id grouper vote
202143 20215 0
312766 31277 0
459159 45916 0
533946 53395 0
595653 59566 0
643821 64383 0
871114 87112 0
980017 98002 0

I intentionally made so few con's to demonstrate the difference.

Let's write a query that shows us how many decisions passed and how many did not:

SELECT  decision, COUNT(*)
FROM    (
        SELECT  grouper, BOOL_AND(vote) AS decision
        FROM    t_and
        GROUP BY
                grouper
        ) q
GROUP BY
        decision
decision count
1 99993
0 8
2 rows fetched in 0.0002s (2.7719s)
HashAggregate  (cost=151436.80..151436.82 rows=2 width=1)
  ->  GroupAggregate  (cost=142395.56..150595.76 rows=56069 width=5)
        ->  Sort  (cost=142395.56..144895.34 rows=999912 width=5)
              Sort Key: t_and.grouper
              ->  Seq Scan on t_and  (cost=0.00..15405.12 rows=999912 width=5)

and runs for almost 3 seconds.

Now, we need this function to stop whenever it finds a false value.

To do this, we will employ PostgreSQL's DISTINCT ON (expression) clause.

This clause, being invoked after ORDER BY, selects only one (first) row per each DISTINCT value of the expression.

ORDER BY clause defines which row is considered first. For this clause to work, expression should be first in the ORDER BY list.

This is an analog of ROW_NUMBER() OVER (PARTITION BY expression) = 1 in Oracle and SQL Server.

Let's check it:

SELECT  decision, COUNT(*)
FROM    (
        SELECT  DISTINCT ON (grouper) grouper, vote AS decision
        FROM    t_and
        ORDER BY
                grouper, vote NULLS FIRST
        ) q
GROUP BY
        decision
decision count
1 99993
0 8
2 rows fetched in 0.0002s (1.9469s)
HashAggregate  (cost=148236.15..148236.18 rows=2 width=1)
  ->  Unique  (cost=142395.56..147395.12 rows=56069 width=5)
        ->  Sort  (cost=142395.56..144895.34 rows=999912 width=5)
              Sort Key: t_and.grouper, t_and.vote
              ->  Seq Scan on t_and  (cost=0.00..15405.12 rows=999912 width=5)

Now, it's less than 2 seconds: replacing GroupAggregate with Unique alone yields a 40% performance gain.

What's happening here?

For each given grouper, the first vote is selected.

Which vote is first is identified by ORDER BY clause: ORDER BY grouper, vote NULLS FIRST. That is NULL's come first, then FALSE's, then TRUE's.

It's easy to see that the first returned vote identifies the return value:

  • If it's NULL, it automatically makes the whole AND to be NULL
  • If it's FALSE, then there were no NULL's (or they would be returned first). The return value is FALSE also
  • Finally, if the first vote is TRUE, then all other vote's are also TRUE's, and so is the result

Now, let's try to select only those polls that were vetoed.

First, with the aggregates:

SELECT  grouper
FROM    t_and
GROUP BY
        grouper
HAVING  BOOL_AND(vote) IS FALSE

grouper
20215
31277
45916
53395
59566
64383
87112
98002
8 rows fetched in 0.0002s (2.7406s)
GroupAggregate  (cost=142395.56..150595.76 rows=56069 width=5)
  Filter: (bool_and(vote) IS FALSE)
  ->  Sort  (cost=142395.56..144895.34 rows=999912 width=5)
        Sort Key: grouper
        ->  Seq Scan on t_and  (cost=0.00..15405.12 rows=999912 width=5)

This performs just like the original query: it still need to count all the aggregates.

Now, using DISTINCT ON:

SELECT  DISTINCT ON (grouper) grouper, vote AS decision
FROM    t_and
WHERE   NOT vote OR vote IS NULL
ORDER BY
        grouper, vote NULLS FIRST

grouper decision
20215 0
31277 0
45916 0
53395 0
59566 0
64383 0
87112 0
98002 0
8 rows fetched in 0.0003s (0.2485s)
Unique  (cost=15405.13..15405.14 rows=1 width=5)
  ->  Sort  (cost=15405.13..15405.14 rows=1 width=5)
        Sort Key: grouper, vote
        ->  Seq Scan on t_and  (cost=0.00..15405.12 rows=1 width=5)
              Filter: ((NOT vote) OR (vote IS NULL))

And this query runs more than 10 times as fast as the aggregate.

This is because it filters out the TRUE's first and has much fewer record to be sorted and uniquified.

Written by Quassnoi

May 30th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply