SQL Server: aggregate bitwise OR
From Stack Overflow:
I am creating a script for merging and deleting duplicate rows from a table.
The table contains address information, and uses an integer field for storing information about the email as bit flags (column name
value
). For example, if bit 1 is set invalue
, that means the record is a primary address.There are instances of the same email being entered twice, but sometimes with different
value
s. To resolve this, I need to take thevalue
from all duplicates, assign them to one surviving record and delete the rest.My biggest headache so far as been with the merging of the records. What I want to do is bitwise
OR
allvalue
s of duplicate records together.
From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.
However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.
We have three problems here:
- Select a first record for each set of duplicates
- Update this record with bitwise
OR
of all values in its set - Delete all other records
Step 1 is easy to do using ROW_NUMBER()
.
Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.
See this article I wrote some time ago on how to do this:
Now, the main problem is step 2.
SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.
The main idea here is that for bit values, aggregate OR
and AND
can be replaced with MAX
and MIN
, accordingly.
All we need is to split each value into the bits, aggregate each bit and merge the results together.
Let's create a sample table:
CREATE SCHEMA [20090713_bitwise] CREATE TABLE t_address ( id INT NOT NULL PRIMARY KEY, address VARCHAR(50) NOT NULL, value INT NOT NULL ) GO BEGIN TRANSACTION DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 20 BEGIN INSERT INTO [20090713_bitwise].t_address VALUES ( @cnt, 'address' + CAST((@cnt - 1) % 5 + 1 AS VARCHAR) + '@example.com', POWER(2, @cnt - 1) ) SET @cnt = @cnt + 1 END COMMIT GO
Here's what we have in this table:
SELECT * FROM [20090713_bitwise].t_address
id | address | value |
---|---|---|
1 | address1@example.com | 1 |
2 | address2@example.com | 2 |
3 | address3@example.com | 4 |
4 | address4@example.com | 8 |
5 | address5@example.com | 16 |
6 | address1@example.com | 32 |
7 | address2@example.com | 64 |
8 | address3@example.com | 128 |
9 | address4@example.com | 256 |
10 | address5@example.com | 512 |
11 | address1@example.com | 1024 |
12 | address2@example.com | 2048 |
13 | address3@example.com | 4096 |
14 | address4@example.com | 8192 |
15 | address5@example.com | 16384 |
16 | address1@example.com | 32768 |
17 | address2@example.com | 65536 |
18 | address3@example.com | 131072 |
19 | address4@example.com | 262144 |
20 | address5@example.com | 524288 |
We need to find a bitwise OR
aggregate for each address.
To do this, we need to generate a dummy resultset of values from 0 to 30. This will give us 31 records, one for each bit (possible for a signed integer).
For each bit record, we need to make a bitmask. SQL Server lacks bit shift operations, but they can be easily emulated using POWER(2)
.
Here's the bit resultset applied to record 1 from our table:
WITH bits AS ( SELECT 0 AS b UNION ALL SELECT b + 1 FROM bits WHERE b <= 29 ) SELECT b, a.*, POWER(2, b) AS bitmask, value & POWER(2, b) AS bitset FROM bits CROSS JOIN [20090713_bitwise].t_address a WHERE id = 10
b | id | address | value | bitmask | bitset |
---|---|---|---|---|---|
0 | 10 | address5@example.com | 512 | 1 | 0 |
1 | 10 | address5@example.com | 512 | 2 | 0 |
2 | 10 | address5@example.com | 512 | 4 | 0 |
3 | 10 | address5@example.com | 512 | 8 | 0 |
4 | 10 | address5@example.com | 512 | 16 | 0 |
5 | 10 | address5@example.com | 512 | 32 | 0 |
6 | 10 | address5@example.com | 512 | 64 | 0 |
7 | 10 | address5@example.com | 512 | 128 | 0 |
8 | 10 | address5@example.com | 512 | 256 | 0 |
9 | 10 | address5@example.com | 512 | 512 | 512 |
10 | 10 | address5@example.com | 512 | 1024 | 0 |
11 | 10 | address5@example.com | 512 | 2048 | 0 |
12 | 10 | address5@example.com | 512 | 4096 | 0 |
13 | 10 | address5@example.com | 512 | 8192 | 0 |
14 | 10 | address5@example.com | 512 | 16384 | 0 |
15 | 10 | address5@example.com | 512 | 32768 | 0 |
16 | 10 | address5@example.com | 512 | 65536 | 0 |
17 | 10 | address5@example.com | 512 | 131072 | 0 |
18 | 10 | address5@example.com | 512 | 262144 | 0 |
19 | 10 | address5@example.com | 512 | 524288 | 0 |
20 | 10 | address5@example.com | 512 | 1048576 | 0 |
21 | 10 | address5@example.com | 512 | 2097152 | 0 |
22 | 10 | address5@example.com | 512 | 4194304 | 0 |
23 | 10 | address5@example.com | 512 | 8388608 | 0 |
24 | 10 | address5@example.com | 512 | 16777216 | 0 |
25 | 10 | address5@example.com | 512 | 33554432 | 0 |
26 | 10 | address5@example.com | 512 | 67108864 | 0 |
27 | 10 | address5@example.com | 512 | 134217728 | 0 |
28 | 10 | address5@example.com | 512 | 268435456 | 0 |
29 | 10 | address5@example.com | 512 | 536870912 | 0 |
30 | 10 | address5@example.com | 512 | 1073741824 | 0 |
Now we should find a maximum of bitset
among all addresses for each bit. Since a given bit always produces either POWER(2, b)
in bitand if the bit is set in value
or a zero if it's not, the groupwise maximum of bitset
will be an aggregate OR
of all bits in this position over the group:
WITH bits AS ( SELECT 0 AS b UNION ALL SELECT b + 1 FROM bits WHERE b <= 29 ) SELECT a.id, MAX(value & POWER(2, b)) AS bitor FROM bits CROSS JOIN [20090713_bitwise].t_address a WHERE b = 4 AND address = 'address5@example.com' GROUP BY id WITH ROLLUP
id | bitor |
---|---|
5 | 16 |
10 | 0 |
15 | 0 |
20 | 0 |
16 |
Here are the records corresponding to bit 4. It is set for the record 5 and not set for other records, so the aggregate OR
will set in the result.
Then we need to SUM
the maximums of all bitset
s we got for each group:
WITH bits AS ( SELECT 0 AS b UNION ALL SELECT b + 1 FROM bits WHERE b <= 29 ) SELECT a.*, ( SELECT SUM(bitor) FROM ( SELECT b, MAX(value & POWER(2, b)) AS bitor FROM [20090713_bitwise].t_address ai CROSS JOIN bits WHERE ai.address = a.address GROUP BY b ) bo ) AS bitor_agg FROM [20090713_bitwise].t_address a
id | address | value | bitor_agg |
---|---|---|---|
1 | address1@example.com | 1 | 33825 |
2 | address2@example.com | 2 | 67650 |
3 | address3@example.com | 4 | 135300 |
4 | address4@example.com | 8 | 270600 |
5 | address5@example.com | 16 | 541200 |
6 | address1@example.com | 32 | 33825 |
7 | address2@example.com | 64 | 67650 |
8 | address3@example.com | 128 | 135300 |
9 | address4@example.com | 256 | 270600 |
10 | address5@example.com | 512 | 541200 |
11 | address1@example.com | 1024 | 33825 |
12 | address2@example.com | 2048 | 67650 |
13 | address3@example.com | 4096 | 135300 |
14 | address4@example.com | 8192 | 270600 |
15 | address5@example.com | 16384 | 541200 |
16 | address1@example.com | 32768 | 33825 |
17 | address2@example.com | 65536 | 67650 |
18 | address3@example.com | 131072 | 135300 |
19 | address4@example.com | 262144 | 270600 |
20 | address5@example.com | 524288 | 541200 |
This query returns aggregate OR
for all values in the group. For instance, 33825 is 0x8421 in hex, which is an OR
combination of individual bits 0, 5, 10 and 15 set (bits offsets are 0-based).
We now need to UPDATE
the value
of the first record in each group with the new value
. This is easily done using ROW_NUMBER()
:
WITH bits AS ( SELECT 0 AS b UNION ALL SELECT b + 1 FROM bits WHERE b <= 29 ), v AS ( SELECT a.*, ( SELECT SUM(bitor) FROM ( SELECT b, MAX(value & POWER(2, b)) AS bitor FROM [20090713_bitwise].t_address ai CROSS JOIN bits WHERE ai.address = a.address GROUP BY b ) bo ) AS bitor_agg FROM ( SELECT ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn FROM [20090713_bitwise].t_address ai ) a WHERE rn = 1 ) UPDATE v SET value = bitor_agg
Here's what we get after this UPDATE
:
SELECT * FROM [20090713_bitwise].t_address
id | address | value |
---|---|---|
1 | address1@example.com | 33825 |
2 | address2@example.com | 67650 |
3 | address3@example.com | 135300 |
4 | address4@example.com | 270600 |
5 | address5@example.com | 541200 |
6 | address1@example.com | 32 |
7 | address2@example.com | 64 |
8 | address3@example.com | 128 |
9 | address4@example.com | 256 |
10 | address5@example.com | 512 |
11 | address1@example.com | 1024 |
12 | address2@example.com | 2048 |
13 | address3@example.com | 4096 |
14 | address4@example.com | 8192 |
15 | address5@example.com | 16384 |
16 | address1@example.com | 32768 |
17 | address2@example.com | 65536 |
18 | address3@example.com | 131072 |
19 | address4@example.com | 262144 |
20 | address5@example.com | 524288 |
Finally, we just need to get rid of duplicates. Again, using CTEs and ROW_NUMBER()
:
WITH v AS ( SELECT ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn FROM [20090713_bitwise].t_address ai ) DELETE FROM v WHERE rn > 1
And the final resultset:
SELECT * FROM [20090713_bitwise].t_address
id | address | value |
---|---|---|
1 | address1@example.com | 33825 |
2 | address2@example.com | 67650 |
3 | address3@example.com | 135300 |
4 | address4@example.com | 270600 |
5 | address5@example.com | 541200 |
That is exactly what we need.
Didn’t read everything, but wanted to mention that for just getting the bitwise-or sum one can use
SELECT SUM(DISTINCT bitwise-value)
…
Ayalisha
5 Mar 13 at 14:06
@Ayalisha: what do I get if I sum 0x01 and 0x03?
Quassnoi
5 Mar 13 at 15:09