EXPLAIN EXTENDED

How to create fast database queries

Archive for July 30th, 2009

PostgreSQL: making duplicate records unique

Comments enabled. I *really* need your comment

From Stack Overflow:

In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.

For integration with another system, I need to add a column to the table that makes these duplicate rows unique.

The table is essentially:

txn# detail# amount

I could just use an incrementing seq# for every existing row, however, the application changes will be to take the existing almost key (basically (txn#, detail#)), and increment the sequence number for this combination

So, e. g., I'll have two rows for (txn#, detail#) = (1, 1), and have seq# 1 for the first and 2 for the second.

If (txn#, detail#) = (513, 44) has 3 duplicate rows, these would have seq# 1, 2 and 3 appropriately.

We just need to update the table with a condition that would distinguish between the rows somehow.

Fortunatly, every PostgreSQL table has an implicit primary key, ctid.

In fact, it's the pointer to the place in the datafile where the row data resides.

In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.

This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):

CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT);

CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail);

INSERT
INTO    t_dup (txn, detail, seq)
SELECT  t, d, NULL
FROM    generate_series(1, 100) t,
generate_series(1, 1000) d,
generate_series (1, 2) s;

ANALYZE t_dup;

Here's the query to update it:

UPDATE  ONLY t_dup
SET     seq = rn
FROM    (
SELECT  ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn
FROM    t_dup
) d
WHERE   t_dup.ctid = d.ctid

This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.

For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).

Fortunately, ctids are comparable, and we can use less than (<) operator on them to build an ordered set.

This is less efficient, but still works (if the keys are almost unique, i. e. there are not many duplicates of each key):

UPDATE	ONLY t_dup d
SET	seq =
(
SELECT	COUNT(*)
FROM	t_dup di
WHERE	di.txn = d.txn
AND di.detail = d.detail
AND di.ctid &lt;= d.ctid
	);

This takes a little bit longer (15 seconds), however still works.

Written by Quassnoi

July 30th, 2009 at 11:00 pm

Posted in PostgreSQL