Archive for July 13th, 2009
From Stack Overflow:
I am creating a script for merging and deleting duplicate rows from a table.
The table contains address information, and uses an integer field for storing information about the email as bit flags (column name
value). For example, if bit 1 is set in
value, that means the record is a primary address.
There are instances of the same email being entered twice, but sometimes with different
values. To resolve this, I need to take the
valuefrom all duplicates, assign them to one surviving record and delete the rest.
My biggest headache so far as been with the merging of the records. What I want to do is bitwise
values of duplicate records together.
From database theory’s point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.
However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.
We have three problems here:
- Select a first record for each set of duplicates
- Update this record with bitwise
ORof all values in its set
- Delete all other records
Step 1 is easy to do using
Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.
See this article I wrote some time ago on how to do this:
Now, the main problem is step 2.
SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.
The main idea here is that for bit values, aggregate
AND can be replaced with
All we need is to split each value into the bits, aggregate each bit and merge the results together.
Let’s create a sample table:
Read the rest of this entry »