EXPLAIN EXTENDED

How to create fast database queries

Counting missing rows: PostgreSQL

Comments enabled. I *really* need your comment

This is the 5th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Finally let's look how PostgreSQL copes with this predicate.

Let's create sample tables:

Table creation details

As in previous articles, table1 has 1,000,000 rows and table2 has 1,000 rows.

Now, let's check the main query:

SELECT  COUNT(*)
FROM    table1
WHERE   id NOT IN
(
SELECT  id
FROM    table2
)

count
999000
1 row fetched in 0.0001s (0.5469s)

This query runs for 546 ms and uses the following plan:

Aggregate  (cost=21017.25..21017.26 rows=1 width=0)
  ->  Seq Scan on table1  (cost=19.50..19767.91 rows=499736 width=0)
        Filter: (NOT (hashed subplan))
        SubPlan
          ->  Seq Scan on table2  (cost=0.00..17.00 rows=1000 width=4)

HASHED SUBPLAN is much alike HASH ANTI JOIN used by Oracle: it builds a hash table over the id's from table2 and searches the id's from table1 in this hash table.

Unlike Oracle, PostgreSQL chooses SEQ SCAN (which scans the table data pages) to retrieve the rowset, because PostgreSQL isn't capable of doing the INDEX FAST FULL SCAN.

If we search for missing grouper's, both query plan and execution time remain the same:

SELECT  COUNT(*)
FROM    table1
WHERE   id NOT IN
(
SELECT  id
FROM    table2
)

count
999000
1 row fetched in 0.0001s (0.5625s)
Aggregate  (cost=21017.25..21017.26 rows=1 width=0)
  ->  Seq Scan on table1  (cost=19.50..19767.91 rows=499736 width=0)
        Filter: (NOT (hashed subplan))
        SubPlan
          ->  Seq Scan on table2  (cost=0.00..17.00 rows=1000 width=4)

Let's try the tricks we used on other systems.

First, let's try joining on DISTINCT:

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT grouper
FROM    table1
) d
JOIN    table1 t1
ON      t1.grouper = d.grouper
WHERE   d.grouper NOT IN
(
SELECT  id
FROM    table2
)

count
1000
1 row fetched in 0.0001s (2.1875s)
Aggregate  (cost=106988.95..106988.96 rows=1 width=0)
  ->  Hash Join  (cost=33667.64..105740.84 rows=499245 width=0)
        Hash Cond: (d.grouper = t1.grouper)
        ->  Subquery Scan d  (cost=19.50..60862.70 rows=508 width=4)
              Filter: (NOT (hashed subplan))
              ->  Unique  (cost=0.00..60830.49 rows=1017 width=4)
                    ->  Index Scan using ix_table1_grouper on table1  (cost=0.00..58331.80 rows=999473 width=4)
              SubPlan
                ->  Seq Scan on table2  (cost=0.00..17.00 rows=1000 width=4)
        ->  Hash  (cost=17249.73..17249.73 rows=999473 width=4)
              ->  Seq Scan on table1 t1  (cost=0.00..17249.73 rows=999473 width=4)

This is a huge performance degradation, the query runs four times as long.

There are two reasons for this:

  1. PostgreSQL's DISTINCT operation relies on SORT, and the engine traverses the index on grouper to obtain the sorted resultset. Index traversing is inefficient in PostgreSQL compared to table scan.
  2. HASH JOIN algorithm is used to join table1 and d. This requires building the hash table on d and scanning all the records from table1, which is less efficient compared to NESTED LOOPS.

We can rewrite this query a little and force PostgreSQL to use more efficient algorithms:

SELECT  COUNT(*)
FROM    (
SELECT  grouper
FROM    table1
GROUP BY
grouper
) d
JOIN    table1 t1
ON      t1.grouper <= d.grouper
        AND t1.grouper >= d.grouper
WHERE   d.grouper NOT IN
(
SELECT  id
FROM    table2
)

count
1000
1 row fetched in 0.0001s (0.4996s)
Aggregate  (cost=2178905.71..2178905.72 rows=1 width=0)
  ->  Nested Loop  (cost=19767.91..2037868.96 rows=56414698 width=0)
        ->  Subquery Scan d  (cost=19767.91..19790.79 rows=508 width=4)
              Filter: (NOT (hashed subplan))
              ->  HashAggregate  (cost=19748.41..19758.58 rows=1017 width=4)
                    ->  Seq Scan on table1  (cost=0.00..17249.73 rows=999473 width=4)
              SubPlan
                ->  Seq Scan on table2  (cost=0.00..17.00 rows=1000 width=4)
        ->  Index Scan using ix_table1_grouper on table1 t1  (cost=0.00..2306.80 rows=111053 width=4)
              Index Cond: ((t1.grouper <= d.grouper) AND (t1.grouper >= d.grouper))

This query uses more efficient HashAggregate to build DISTINCT grouper's and the inequality JOIN conditions make NESTED LOOP's the only usable join algorithm (HASH JOIN and MERGE JOIN both require an equality condition).

As a result, this query runs much faster compared to the DISTINCT query, and even a little more faster than the original NOT IN query.

Now, let's check reverse COUNT(*):

SELECT  (
SELECT  COUNT(*)
FROM    table1
) -
(
SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.id IN
(
SELECT  id
FROM    table2
)
)

?column?
999000
1 row fetched in 0.0001s (0.2302s)
Result  (cost=19824.35..19824.36 rows=1 width=0)
  InitPlan
    ->  Aggregate  (cost=19748.42..19748.42 rows=1 width=0)
          ->  Seq Scan on table1  (cost=0.00..17249.73 rows=999473 width=0)
    ->  Aggregate  (cost=75.91..75.92 rows=1 width=0)
          ->  Merge IN Join  (cost=25.31..73.41 rows=1000 width=0)
                Merge Cond: (t1.id = table2.id)
                ->  Index Scan using table1_pkey on table1 t1  (cost=0.00..33230.45 rows=999473 width=4)
                ->  Index Scan using table2_pkey on table2  (cost=0.00..45.25 rows=1000 width=4)

PostgreSQL uses MERGE IN JOIN, which is similar to the plan used by SQL Server. This implies index scanning on both tables, since MERGE algorithms require sorted resultsets.

MERGE IN does not care for the values not present in table2, that's why it can skip the non-existing values from table1 as soon as it finds out they're nonexistent.

Since all values present in table2 are found in the very beginning of table1, this algorithm is as efficient as NESTED LOOPS.

As with the other database systems, COUNT(*) reversal seems to improve performance.

Summary:


PostgreSQL uses NOT (HASHED SUBPLAN) to find the missing rows (both for selective and non-selective fields).

Using GROUP BY for narrowing dows the lookup resultset improves the query a little, though a hack is needed to enforce efficient NESTED LOOPS not chosen by optimizer.

As with other systems, using COUNT(*) reversal also improves the query performance when there are lots of rows satisfying the NOT IN condition.

Written by Quassnoi

April 22nd, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply