Archive for July 30th, 2009
PostgreSQL: making duplicate records unique
Comments enabled. I *really* need your comment
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)), and increment the sequence number for this combinationSo, e. g., I'll have two rows for
(txn#, detail#) = (1, 1), and haveseq#1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)has 3 duplicate rows, these would haveseq#1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid.
In fact, it's the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT); CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail); INSERT INTO t_dup (txn, detail, seq) SELECT t, d, NULL FROM generate_series(1, 100) t, generate_series(1, 1000) d, generate_series (1, 2) s; ANALYZE t_dup;
Here's the query to update it:
UPDATE ONLY t_dup SET seq = rn FROM ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn FROM t_dup ) d WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).
Fortunately, ctids are comparable, and we can use less than
(<) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.
Subscribe in a reader