Archive for May 3rd, 2009
PostgreSQL: optimizing DISTINCT
In PostgreSQL (as of 8.3, at least), performance of DISTINCT
clause in SELECT
list is quite poor.
Probably because DISTINCT
code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.
GROUP BY
that can be used for the same purpose is more smart, as it employs more efficient HashAggregate
, but its performance is still poor for large dataset.
All major RDBMS
's, including MySQL, are able to jump over index keys to select DISTINCT
values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT
values.
This behavior can be emulated in PostgreSQL too.
Let's create a sample table:
Read the rest of this entry »