PostgreSQL: making duplicate records unique
Comments enabled. I *really* need your comment
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#
for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)
), and increment the sequence number for this combinationSo, e. g., I'll have two rows for
(txn#, detail#) = (1, 1)
, and haveseq#
1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)
has 3 duplicate rows, these would haveseq#
1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid
.
In fact, it's the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq
to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail)
pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT); CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail); INSERT INTO t_dup (txn, detail, seq) SELECT t, d, NULL FROM generate_series(1, 100) t, generate_series(1, 1000) d, generate_series (1, 2) s; ANALYZE t_dup;
Here's the query to update it:
UPDATE ONLY t_dup SET seq = rn FROM ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn FROM t_dup ) d WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*)
.
Fortunately, ctid
s are comparable, and we can use less than
(<
) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.