Counting missing rows: PostgreSQL
Comments enabled. I *really* need your comment
This is the 5th of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:
NOT INin MySQLNOT INin MySQL (usingDISTINCT)NOT INin SQL ServerNOT INin OracleNOT INin PostgreSQL
Finally let's look how PostgreSQL copes with this predicate.
Let's create sample tables:
Table creation details
As in previous articles, table1 has 1,000,000 rows and table2 has 1,000 rows.
Now, let's check the main query:
SELECT COUNT(*) FROM table1 WHERE id NOT IN ( SELECT id FROM table2 )
| count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.5469s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This query runs for 546 ms and uses the following plan:
Aggregate (cost=21017.25..21017.26 rows=1 width=0)
-> Seq Scan on table1 (cost=19.50..19767.91 rows=499736 width=0)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
HASHED SUBPLAN is much alike HASH ANTI JOIN used by Oracle: it builds a hash table over the id's from table2 and searches the id's from table1 in this hash table.
Unlike Oracle, PostgreSQL chooses SEQ SCAN (which scans the table data pages) to retrieve the rowset, because PostgreSQL isn't capable of doing the INDEX FAST FULL SCAN.
If we search for missing grouper's, both query plan and execution time remain the same:
SELECT COUNT(*) FROM table1 WHERE id NOT IN ( SELECT id FROM table2 )
| count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.5625s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aggregate (cost=21017.25..21017.26 rows=1 width=0)
-> Seq Scan on table1 (cost=19.50..19767.91 rows=499736 width=0)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
Let's try the tricks we used on other systems.
First, let's try joining on DISTINCT:
SELECT COUNT(*) FROM ( SELECT DISTINCT grouper FROM table1 ) d JOIN table1 t1 ON t1.grouper = d.grouper WHERE d.grouper NOT IN ( SELECT id FROM table2 )
| count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (2.1875s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aggregate (cost=106988.95..106988.96 rows=1 width=0)
-> Hash Join (cost=33667.64..105740.84 rows=499245 width=0)
Hash Cond: (d.grouper = t1.grouper)
-> Subquery Scan d (cost=19.50..60862.70 rows=508 width=4)
Filter: (NOT (hashed subplan))
-> Unique (cost=0.00..60830.49 rows=1017 width=4)
-> Index Scan using ix_table1_grouper on table1 (cost=0.00..58331.80 rows=999473 width=4)
SubPlan
-> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
-> Hash (cost=17249.73..17249.73 rows=999473 width=4)
-> Seq Scan on table1 t1 (cost=0.00..17249.73 rows=999473 width=4)
This is a huge performance degradation, the query runs four times as long.
There are two reasons for this:
- PostgreSQL's
DISTINCToperation relies onSORT, and the engine traverses the index ongrouperto obtain the sorted resultset. Index traversing is inefficient in PostgreSQL compared to table scan. HASH JOINalgorithm is used to jointable1andd. This requires building the hash table ondand scanning all the records fromtable1, which is less efficient compared toNESTED LOOPS.
We can rewrite this query a little and force PostgreSQL to use more efficient algorithms:
SELECT COUNT(*)
FROM (
SELECT grouper
FROM table1
GROUP BY
grouper
) d
JOIN table1 t1
ON t1.grouper <= d.grouper
AND t1.grouper >= d.grouper
WHERE d.grouper NOT IN
(
SELECT id
FROM table2
)
| count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.4996s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aggregate (cost=2178905.71..2178905.72 rows=1 width=0)
-> Nested Loop (cost=19767.91..2037868.96 rows=56414698 width=0)
-> Subquery Scan d (cost=19767.91..19790.79 rows=508 width=4)
Filter: (NOT (hashed subplan))
-> HashAggregate (cost=19748.41..19758.58 rows=1017 width=4)
-> Seq Scan on table1 (cost=0.00..17249.73 rows=999473 width=4)
SubPlan
-> Seq Scan on table2 (cost=0.00..17.00 rows=1000 width=4)
-> Index Scan using ix_table1_grouper on table1 t1 (cost=0.00..2306.80 rows=111053 width=4)
Index Cond: ((t1.grouper <= d.grouper) AND (t1.grouper >= d.grouper))
This query uses more efficient HashAggregate to build DISTINCT grouper's and the inequality JOIN conditions make NESTED LOOP's the only usable join algorithm (HASH JOIN and MERGE JOIN both require an equality condition).
As a result, this query runs much faster compared to the DISTINCT query, and even a little more faster than the original NOT IN query.
Now, let's check reverse COUNT(*):
SELECT ( SELECT COUNT(*) FROM table1 ) - ( SELECT COUNT(*) FROM table1 t1 WHERE t1.id IN ( SELECT id FROM table2 ) )
| ?column? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.2302s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Result (cost=19824.35..19824.36 rows=1 width=0)
InitPlan
-> Aggregate (cost=19748.42..19748.42 rows=1 width=0)
-> Seq Scan on table1 (cost=0.00..17249.73 rows=999473 width=0)
-> Aggregate (cost=75.91..75.92 rows=1 width=0)
-> Merge IN Join (cost=25.31..73.41 rows=1000 width=0)
Merge Cond: (t1.id = table2.id)
-> Index Scan using table1_pkey on table1 t1 (cost=0.00..33230.45 rows=999473 width=4)
-> Index Scan using table2_pkey on table2 (cost=0.00..45.25 rows=1000 width=4)
PostgreSQL uses MERGE IN JOIN, which is similar to the plan used by SQL Server. This implies index scanning on both tables, since MERGE algorithms require sorted resultsets.
MERGE IN does not care for the values not present in table2, that's why it can skip the non-existing values from table1 as soon as it finds out they're nonexistent.
Since all values present in table2 are found in the very beginning of table1, this algorithm is as efficient as NESTED LOOPS.
As with the other database systems, COUNT(*) reversal seems to improve performance.
Summary:
PostgreSQL uses NOT (HASHED SUBPLAN) to find the missing rows (both for selective and non-selective fields).
Using GROUP BY for narrowing dows the lookup resultset improves the query a little, though a hack is needed to enforce efficient NESTED LOOPS not chosen by optimizer.
As with other systems, using COUNT(*) reversal also improves the query performance when there are lots of rows satisfying the NOT IN condition.
Subscribe in a reader