SQL Server: aggregate bitwise OR
From Stack Overflow:
I am creating a script for merging and deleting duplicate rows from a table.
The table contains address information, and uses an integer field for storing information about the email as bit flags (column name
value). For example, if bit 1 is set invalue, that means the record is a primary address.There are instances of the same email being entered twice, but sometimes with different
values. To resolve this, I need to take thevaluefrom all duplicates, assign them to one surviving record and delete the rest.My biggest headache so far as been with the merging of the records. What I want to do is bitwise
ORallvalues of duplicate records together.
From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.
However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.
We have three problems here:
- Select a first record for each set of duplicates
- Update this record with bitwise
ORof all values in its set - Delete all other records
Step 1 is easy to do using ROW_NUMBER().
Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.
See this article I wrote some time ago on how to do this:
Now, the main problem is step 2.
SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.
The main idea here is that for bit values, aggregate OR and AND can be replaced with MAX and MIN, accordingly.
All we need is to split each value into the bits, aggregate each bit and merge the results together.
Let's create a sample table:
CREATE SCHEMA [20090713_bitwise]
CREATE TABLE t_address (
id INT NOT NULL PRIMARY KEY,
address VARCHAR(50) NOT NULL,
value INT NOT NULL
)
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 20
BEGIN
INSERT
INTO [20090713_bitwise].t_address
VALUES (
@cnt,
'address' + CAST((@cnt - 1) % 5 + 1 AS VARCHAR) + '@example.com',
POWER(2, @cnt - 1)
)
SET @cnt = @cnt + 1
END
COMMIT
GO
Here's what we have in this table:
SELECT * FROM [20090713_bitwise].t_address
| id | address | value |
|---|---|---|
| 1 | address1@example.com | 1 |
| 2 | address2@example.com | 2 |
| 3 | address3@example.com | 4 |
| 4 | address4@example.com | 8 |
| 5 | address5@example.com | 16 |
| 6 | address1@example.com | 32 |
| 7 | address2@example.com | 64 |
| 8 | address3@example.com | 128 |
| 9 | address4@example.com | 256 |
| 10 | address5@example.com | 512 |
| 11 | address1@example.com | 1024 |
| 12 | address2@example.com | 2048 |
| 13 | address3@example.com | 4096 |
| 14 | address4@example.com | 8192 |
| 15 | address5@example.com | 16384 |
| 16 | address1@example.com | 32768 |
| 17 | address2@example.com | 65536 |
| 18 | address3@example.com | 131072 |
| 19 | address4@example.com | 262144 |
| 20 | address5@example.com | 524288 |
We need to find a bitwise OR aggregate for each address.
To do this, we need to generate a dummy resultset of values from 0 to 30. This will give us 31 records, one for each bit (possible for a signed integer).
For each bit record, we need to make a bitmask. SQL Server lacks bit shift operations, but they can be easily emulated using POWER(2).
Here's the bit resultset applied to record 1 from our table:
WITH bits AS
(
SELECT 0 AS b
UNION ALL
SELECT b + 1
FROM bits
WHERE b <= 29
)
SELECT b, a.*, POWER(2, b) AS bitmask, value & POWER(2, b) AS bitset
FROM bits
CROSS JOIN
[20090713_bitwise].t_address a
WHERE id = 10
| b | id | address | value | bitmask | bitset |
|---|---|---|---|---|---|
| 0 | 10 | address5@example.com | 512 | 1 | 0 |
| 1 | 10 | address5@example.com | 512 | 2 | 0 |
| 2 | 10 | address5@example.com | 512 | 4 | 0 |
| 3 | 10 | address5@example.com | 512 | 8 | 0 |
| 4 | 10 | address5@example.com | 512 | 16 | 0 |
| 5 | 10 | address5@example.com | 512 | 32 | 0 |
| 6 | 10 | address5@example.com | 512 | 64 | 0 |
| 7 | 10 | address5@example.com | 512 | 128 | 0 |
| 8 | 10 | address5@example.com | 512 | 256 | 0 |
| 9 | 10 | address5@example.com | 512 | 512 | 512 |
| 10 | 10 | address5@example.com | 512 | 1024 | 0 |
| 11 | 10 | address5@example.com | 512 | 2048 | 0 |
| 12 | 10 | address5@example.com | 512 | 4096 | 0 |
| 13 | 10 | address5@example.com | 512 | 8192 | 0 |
| 14 | 10 | address5@example.com | 512 | 16384 | 0 |
| 15 | 10 | address5@example.com | 512 | 32768 | 0 |
| 16 | 10 | address5@example.com | 512 | 65536 | 0 |
| 17 | 10 | address5@example.com | 512 | 131072 | 0 |
| 18 | 10 | address5@example.com | 512 | 262144 | 0 |
| 19 | 10 | address5@example.com | 512 | 524288 | 0 |
| 20 | 10 | address5@example.com | 512 | 1048576 | 0 |
| 21 | 10 | address5@example.com | 512 | 2097152 | 0 |
| 22 | 10 | address5@example.com | 512 | 4194304 | 0 |
| 23 | 10 | address5@example.com | 512 | 8388608 | 0 |
| 24 | 10 | address5@example.com | 512 | 16777216 | 0 |
| 25 | 10 | address5@example.com | 512 | 33554432 | 0 |
| 26 | 10 | address5@example.com | 512 | 67108864 | 0 |
| 27 | 10 | address5@example.com | 512 | 134217728 | 0 |
| 28 | 10 | address5@example.com | 512 | 268435456 | 0 |
| 29 | 10 | address5@example.com | 512 | 536870912 | 0 |
| 30 | 10 | address5@example.com | 512 | 1073741824 | 0 |
Now we should find a maximum of bitset among all addresses for each bit. Since a given bit always produces either POWER(2, b) in bitand if the bit is set in value or a zero if it's not, the groupwise maximum of bitset will be an aggregate OR of all bits in this position over the group:
WITH bits AS
(
SELECT 0 AS b
UNION ALL
SELECT b + 1
FROM bits
WHERE b <= 29
)
SELECT a.id, MAX(value & POWER(2, b)) AS bitor
FROM bits
CROSS JOIN
[20090713_bitwise].t_address a
WHERE b = 4
AND address = 'address5@example.com'
GROUP BY
id WITH ROLLUP
| id | bitor |
|---|---|
| 5 | 16 |
| 10 | 0 |
| 15 | 0 |
| 20 | 0 |
| 16 |
Here are the records corresponding to bit 4. It is set for the record 5 and not set for other records, so the aggregate OR will set in the result.
Then we need to SUM the maximums of all bitsets we got for each group:
WITH bits AS
(
SELECT 0 AS b
UNION ALL
SELECT b + 1
FROM bits
WHERE b <= 29
)
SELECT a.*,
(
SELECT SUM(bitor)
FROM (
SELECT b, MAX(value & POWER(2, b)) AS bitor
FROM [20090713_bitwise].t_address ai
CROSS JOIN
bits
WHERE ai.address = a.address
GROUP BY
b
) bo
) AS bitor_agg
FROM [20090713_bitwise].t_address a
| id | address | value | bitor_agg |
|---|---|---|---|
| 1 | address1@example.com | 1 | 33825 |
| 2 | address2@example.com | 2 | 67650 |
| 3 | address3@example.com | 4 | 135300 |
| 4 | address4@example.com | 8 | 270600 |
| 5 | address5@example.com | 16 | 541200 |
| 6 | address1@example.com | 32 | 33825 |
| 7 | address2@example.com | 64 | 67650 |
| 8 | address3@example.com | 128 | 135300 |
| 9 | address4@example.com | 256 | 270600 |
| 10 | address5@example.com | 512 | 541200 |
| 11 | address1@example.com | 1024 | 33825 |
| 12 | address2@example.com | 2048 | 67650 |
| 13 | address3@example.com | 4096 | 135300 |
| 14 | address4@example.com | 8192 | 270600 |
| 15 | address5@example.com | 16384 | 541200 |
| 16 | address1@example.com | 32768 | 33825 |
| 17 | address2@example.com | 65536 | 67650 |
| 18 | address3@example.com | 131072 | 135300 |
| 19 | address4@example.com | 262144 | 270600 |
| 20 | address5@example.com | 524288 | 541200 |
This query returns aggregate OR for all values in the group. For instance, 33825 is 0x8421 in hex, which is an OR combination of individual bits 0, 5, 10 and 15 set (bits offsets are 0-based).
We now need to UPDATE the value of the first record in each group with the new value. This is easily done using ROW_NUMBER():
WITH bits AS
(
SELECT 0 AS b
UNION ALL
SELECT b + 1
FROM bits
WHERE b <= 29
),
v AS
(
SELECT a.*,
(
SELECT SUM(bitor)
FROM (
SELECT b, MAX(value & POWER(2, b)) AS bitor
FROM [20090713_bitwise].t_address ai
CROSS JOIN
bits
WHERE ai.address = a.address
GROUP BY
b
) bo
) AS bitor_agg
FROM (
SELECT ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn
FROM [20090713_bitwise].t_address ai
) a
WHERE rn = 1
)
UPDATE v
SET value = bitor_agg
Here's what we get after this UPDATE:
SELECT * FROM [20090713_bitwise].t_address
| id | address | value |
|---|---|---|
| 1 | address1@example.com | 33825 |
| 2 | address2@example.com | 67650 |
| 3 | address3@example.com | 135300 |
| 4 | address4@example.com | 270600 |
| 5 | address5@example.com | 541200 |
| 6 | address1@example.com | 32 |
| 7 | address2@example.com | 64 |
| 8 | address3@example.com | 128 |
| 9 | address4@example.com | 256 |
| 10 | address5@example.com | 512 |
| 11 | address1@example.com | 1024 |
| 12 | address2@example.com | 2048 |
| 13 | address3@example.com | 4096 |
| 14 | address4@example.com | 8192 |
| 15 | address5@example.com | 16384 |
| 16 | address1@example.com | 32768 |
| 17 | address2@example.com | 65536 |
| 18 | address3@example.com | 131072 |
| 19 | address4@example.com | 262144 |
| 20 | address5@example.com | 524288 |
Finally, we just need to get rid of duplicates. Again, using CTEs and ROW_NUMBER():
WITH v AS ( SELECT ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn FROM [20090713_bitwise].t_address ai ) DELETE FROM v WHERE rn > 1
And the final resultset:
SELECT * FROM [20090713_bitwise].t_address
| id | address | value |
|---|---|---|
| 1 | address1@example.com | 33825 |
| 2 | address2@example.com | 67650 |
| 3 | address3@example.com | 135300 |
| 4 | address4@example.com | 270600 |
| 5 | address5@example.com | 541200 |
That is exactly what we need.
Subscribe in a reader
Didn’t read everything, but wanted to mention that for just getting the bitwise-or sum one can use
SELECT SUM(DISTINCT bitwise-value)
…
Ayalisha
5 Mar 13 at 14:06
@Ayalisha: what do I get if I sum 0x01 and 0x03?
Quassnoi
5 Mar 13 at 15:09