EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: selecting a function

Comments enabled. I *really* need your comment

From Stack Overflow:

Hello,

I want to write a SELECT statement as follows:

SELECT  field_a
FROM    my_table
WHERE   field_b IN (my_function(field_c))

Is that possible?

Would my_function have to return an array?

It is of course possible in PostgreSQL using a set returning function.

Return type of such a function should be declared as SETOF, so the function will return a rowset of given type.

There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.

Let's create sample tables:

CREATE TABLE t_address (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
address CIDR NOT NULL
);

CREATE TABLE t_ban (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
address CIDR NOT NULL
);

CREATE INDEX ix_address_grouper ON t_address (grouper);
CREATE INDEX ix_ban_grouper ON t_ban (grouper);

CREATE OR REPLACE FUNCTION fn_ban(INT) RETURNS SETOF t_ban AS $$
SELECT  *
FROM    t_ban
WHERE   grouper = $1
$$
LANGUAGE 'sql';

SELECT setseed(20090514);

BEGIN TRANSACTION;

INSERT
INTO    t_address (id, grouper, address)
SELECT  s, s % 10, '10.0.0.0/32'::cidr + (RANDOM() * x'10000'::integer)::integer
FROM    generate_series(1, 1000) s;

INSERT
INTO    t_ban (id, grouper, address)
SELECT  s, s % 10, '10.0.0.0/32'::cidr + (RANDOM() * x'10000'::integer)::integer
FROM    generate_series(1, 10000) s;

COMMIT;

ANALYZE t_address;
ANALYZE t_ban;

There are two tables: t_address and t_ban.

t_address contains a lits of 1,000 random IP addresses grouped into 10 groups.

t_ban contains a list of 10,000 random banned addresses, also grouped into 10 groups.

fn_ban is a set returning function. Given a group, it returns all banned addresses for this group.

Out goal is to write a query that check all addresses from t_address against the results of the function to see whether they are banned or not.

We can use a traditional IN expession with a subquery:

SELECT  *
FROM    t_address
WHERE   address IN
(
SELECT  (fn_ban(grouper)).address
)

id grouper address
238 8 10.0.252.94/32
239 9 10.0.75.83/32
277 7 10.0.159.64/32
330 0 10.0.125.79/32
451 1 10.0.132.173/32
459 9 10.0.105.176/32
491 1 10.0.136.40/32
645 5 10.0.71.243/32
702 2 10.0.212.83/32
728 8 10.0.46.171/32
767 7 10.0.218.97/32
778 8 10.0.65.71/32
808 8 10.0.8.88/32
844 4 10.0.201.183/32
873 3 10.0.2.48/32
923 3 10.0.87.233/32
16 rows fetched in 0.0006s (1.3593s)
Seq Scan on t_address  (cost=0.00..149.75 rows=500 width=15)
  Filter: (subplan)
  SubPlan
    ->  Result  (cost=0.00..0.26 rows=1 width=0)

PostgreSQL also offers the ability to use a set returning function in a SELECT clause along with other columns.

In this case, each row of the resultset will be repeated as many times as many rows are returned by the function.

In other words, calling a function inside a SELECT clause acts similar to SQL Server's APPLY operator.

We can call this function inside a SELECT clause and then use the outer subquery to filter the results:

SELECT  id, grouper, address
FROM    (
SELECT  a.*, fn_ban(grouper) AS ban
FROM    t_address a
) q
WHERE   q.address = (ban).address

id grouper address
238 8 10.0.252.94/32
239 9 10.0.75.83/32
277 7 10.0.159.64/32
330 0 10.0.125.79/32
451 1 10.0.132.173/32
459 9 10.0.105.176/32
491 1 10.0.136.40/32
645 5 10.0.71.243/32
702 2 10.0.212.83/32
728 8 10.0.46.171/32
767 7 10.0.218.97/32
778 8 10.0.65.71/32
808 8 10.0.8.88/32
844 4 10.0.201.183/32
873 3 10.0.2.48/32
923 3 10.0.87.233/32
16 rows fetched in 0.0006s (1.4374s)
Subquery Scan q  (cost=0.00..278.50 rows=5 width=72)
  Filter: ((q.address)::inet = ((q.ban).address)::inet)
  ->  Seq Scan on t_address a  (cost=0.00..266.00 rows=1000 width=15)

Now, which one is better?

If we look into the plan of the former query (using IN clause), we will see that PostgreSQL uses a Filter on the results of SubPlan.

This Filter returns TRUE as soon as it finds the first matching result, so it just stops scanning the function as soon as it happens.

And we actually can see that the query performance is slightly better in the first case.

For any given value, the function always returns the same rowset:

SELECT  *
FROM    fn_ban(0)

id grouper address
10 0 10.0.227.232/32
20 0 10.0.212.255/32
30 0 10.0.33.91/32
9980 0 10.0.34.80/32
9990 0 10.0.232.149/32
10000 0 10.0.158.92/32
1000 rows fetched in 0.0530s (0.0074s)
Function Scan on fn_ban  (cost=0.00..260.00 rows=1000 width=40)

Let's check what will happen if the subquery matches the first or the last value returned by the function.

This query matches the first value (10.0.227.232/32) of those returned by fn_ban for group 0:

SELECT  COUNT(*)
FROM    t_address
WHERE   grouper = 0
AND '10.0.227.232/32'::cidr IN
(
SELECT  (fn_ban(grouper)).address
)

count
100
1 row fetched in 0.0001s (0.0223s)
Aggregate  (cost=25.77..25.78 rows=1 width=0)
  ->  Bitmap Heap Scan on t_address  (cost=5.01..25.64 rows=50 width=0)
        Recheck Cond: (grouper = 0)
        Filter: (subplan)
        ->  Bitmap Index Scan on ix_address_grouper  (cost=0.00..5.00 rows=100 width=0)
              Index Cond: (grouper = 0)
        SubPlan
          ->  Result  (cost=0.00..0.26 rows=1 width=0)

In the second case, when the subquery matches the last record returned by the function, the execution time is much higher:

SELECT  COUNT(*)
FROM    t_address
WHERE   grouper = 0
AND '10.0.158.92/32'::cidr IN
(
SELECT  (fn_ban(grouper)).address
)

count
100
1 row fetched in 0.0001s (0.1430s)
Aggregate  (cost=25.77..25.78 rows=1 width=0)
  ->  Bitmap Heap Scan on t_address  (cost=5.01..25.64 rows=50 width=0)
        Recheck Cond: (grouper = 0)
        Filter: (subplan)
        ->  Bitmap Index Scan on ix_address_grouper  (cost=0.00..5.00 rows=100 width=0)
              Index Cond: (grouper = 0)
        SubPlan
          ->  Result  (cost=0.00..0.26 rows=1 width=0)

The latter query takes 140 ms to complete, while the former one takes only 20 ms (7 times as performant as the latter one).

We can easily see that PostgreSQL stops fetching the records from the function as soon as it find the first matching record.

Both queries, being rewritten using dervied tables, would run for the same time.

Derived tables, however, have two major advantages.

First, they allow seeing and selecting columns that cannot be evaluated with an IN expression syntax.

Say, we need the id of the banned address.

We cannot select it with a subquery (since it can return only one column and it's already used in the IN clause).

But with a dervied query this is easy:

SELECT  id, grouper, address, (ban).id AS banid
FROM    (
SELECT  a.*, fn_ban(grouper) AS ban
FROM    t_address a
) q
WHERE   q.address = (ban).address

id grouper address banid
238 8 10.0.252.94/32 1638
239 9 10.0.75.83/32 7529
277 7 10.0.159.64/32 9747
330 0 10.0.125.79/32 2260
451 1 10.0.132.173/32 4751
459 9 10.0.105.176/32 7279
491 1 10.0.136.40/32 2381
645 5 10.0.71.243/32 7105
702 2 10.0.212.83/32 5492
728 8 10.0.46.171/32 2248
767 7 10.0.218.97/32 1757
778 8 10.0.65.71/32 9028
808 8 10.0.8.88/32 2428
844 4 10.0.201.183/32 8124
873 3 10.0.2.48/32 233
923 3 10.0.87.233/32 9733
16 rows fetched in 0.0007s (1.4843s)
Subquery Scan q  (cost=0.00..278.50 rows=5 width=72)
  Filter: ((q.address)::inet = ((q.ban).address)::inet)
  ->  Seq Scan on t_address a  (cost=0.00..266.00 rows=1000 width=15)

Now we can see the id or the record in t_ban along with the address.

The second advantage of a derived table is ability to return duplicate records.

If an address has a duplicate entry in t_ban, the derived table will return both occurencies along with their id's, while in IN clause will return only the first.

Summary:

There are two methods for checking records of one table against the results of a set-returning function in PostgreSQL: using IN subquery and using SELECT clause function calling with subsequent filtering.

Using IN subquery is faster, since it stops executing the function as soon as it find the first matching value.

Using the function in SELECT clause is slower by design, since it can return duplicate values.

This method, however, allows selecting other columns along with those we are checking against.

Duplicate values may also be of use, in which case this method will be the only acceptable one.

Written by Quassnoi

May 14th, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply