Counting missing rows: PostgreSQL
Comments enabled. I *really* need your comment
This is the 5th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Finally let's look how PostgreSQL
copes with this predicate.
Let's create sample tables:
Table creation details
As in previous articles, table1
has 1,000,000 rows and table2
has 1,000 rows.
Now, let's check the main query:
SELECT COUNT(*) FROM table1 WHERE id NOT IN ( SELECT id FROM table2 )
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.5469s) |
This query runs for 546 ms and uses the following plan:
Aggregate (cost=21017.25..21017.26 rows=1 width=0) -> Seq Scan on table1 (cost=19.50..19767.91 rows=499736 width=0) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
HASHED SUBPLAN
is much alike HASH ANTI JOIN
used by Oracle: it builds a hash table over the id
's from table2
and searches the id
's from table1
in this hash table.
Unlike Oracle, PostgreSQL chooses SEQ SCAN
(which scans the table data pages) to retrieve the rowset, because PostgreSQL isn't capable of doing the INDEX FAST FULL SCAN
.
If we search for missing grouper
's, both query plan and execution time remain the same:
SELECT COUNT(*) FROM table1 WHERE id NOT IN ( SELECT id FROM table2 )
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.5625s) |
Aggregate (cost=21017.25..21017.26 rows=1 width=0) -> Seq Scan on table1 (cost=19.50..19767.91 rows=499736 width=0) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
Let's try the tricks we used on other systems.
First, let's try joining on DISTINCT
:
SELECT COUNT(*) FROM ( SELECT DISTINCT grouper FROM table1 ) d JOIN table1 t1 ON t1.grouper = d.grouper WHERE d.grouper NOT IN ( SELECT id FROM table2 )
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.1875s) |
Aggregate (cost=106988.95..106988.96 rows=1 width=0) -> Hash Join (cost=33667.64..105740.84 rows=499245 width=0) Hash Cond: (d.grouper = t1.grouper) -> Subquery Scan d (cost=19.50..60862.70 rows=508 width=4) Filter: (NOT (hashed subplan)) -> Unique (cost=0.00..60830.49 rows=1017 width=4) -> Index Scan using ix_table1_grouper on table1 (cost=0.00..58331.80 rows=999473 width=4) SubPlan -> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4) -> Hash (cost=17249.73..17249.73 rows=999473 width=4) -> Seq Scan on table1 t1 (cost=0.00..17249.73 rows=999473 width=4)
This is a huge performance degradation, the query runs four times as long.
There are two reasons for this:
- PostgreSQL's
DISTINCT
operation relies onSORT
, and the engine traverses the index ongrouper
to obtain the sorted resultset. Index traversing is inefficient in PostgreSQL compared to table scan. HASH JOIN
algorithm is used to jointable1
andd
. This requires building the hash table ond
and scanning all the records fromtable1
, which is less efficient compared toNESTED LOOPS
.
We can rewrite this query a little and force PostgreSQL to use more efficient algorithms:
SELECT COUNT(*) FROM ( SELECT grouper FROM table1 GROUP BY grouper ) d JOIN table1 t1 ON t1.grouper <= d.grouper AND t1.grouper >= d.grouper WHERE d.grouper NOT IN ( SELECT id FROM table2 )
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.4996s) |
Aggregate (cost=2178905.71..2178905.72 rows=1 width=0) -> Nested Loop (cost=19767.91..2037868.96 rows=56414698 width=0) -> Subquery Scan d (cost=19767.91..19790.79 rows=508 width=4) Filter: (NOT (hashed subplan)) -> HashAggregate (cost=19748.41..19758.58 rows=1017 width=4) -> Seq Scan on table1 (cost=0.00..17249.73 rows=999473 width=4) SubPlan -> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4) -> Index Scan using ix_table1_grouper on table1 t1 (cost=0.00..2306.80 rows=111053 width=4) Index Cond: ((t1.grouper <= d.grouper) AND (t1.grouper >= d.grouper))
This query uses more efficient HashAggregate
to build DISTINCT grouper
's and the inequality JOIN
conditions make NESTED LOOP
's the only usable join algorithm (HASH JOIN
and MERGE JOIN
both require an equality condition).
As a result, this query runs much faster compared to the DISTINCT
query, and even a little more faster than the original NOT IN
query.
Now, let's check reverse COUNT(*):
SELECT ( SELECT COUNT(*) FROM table1 ) - ( SELECT COUNT(*) FROM table1 t1 WHERE t1.id IN ( SELECT id FROM table2 ) )
?column? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.2302s) |
Result (cost=19824.35..19824.36 rows=1 width=0) InitPlan -> Aggregate (cost=19748.42..19748.42 rows=1 width=0) -> Seq Scan on table1 (cost=0.00..17249.73 rows=999473 width=0) -> Aggregate (cost=75.91..75.92 rows=1 width=0) -> Merge IN Join (cost=25.31..73.41 rows=1000 width=0) Merge Cond: (t1.id = table2.id) -> Index Scan using table1_pkey on table1 t1 (cost=0.00..33230.45 rows=999473 width=4) -> Index Scan using table2_pkey on table2 (cost=0.00..45.25 rows=1000 width=4)
PostgreSQL uses MERGE IN JOIN
, which is similar to the plan used by SQL Server. This implies index scanning on both tables, since MERGE
algorithms require sorted resultsets.
MERGE IN
does not care for the values not present in table2
, that's why it can skip the non-existing values from table1
as soon as it finds out they're nonexistent.
Since all values present in table2
are found in the very beginning of table1
, this algorithm is as efficient as NESTED LOOPS
.
As with the other database systems, COUNT(*)
reversal seems to improve performance.
Summary:
PostgreSQL uses NOT (HASHED SUBPLAN)
to find the missing rows (both for selective and non-selective fields).
Using GROUP BY
for narrowing dows the lookup resultset improves the query a little, though a hack is needed to enforce efficient NESTED LOOPS
not chosen by optimizer.
As with other systems, using COUNT(*)
reversal also improves the query performance when there are lots of rows satisfying the NOT IN
condition.