EXPLAIN EXTENDED

How to create fast database queries

SQL Server: aggregate bitwise OR

with 2 comments

From Stack Overflow:

I am creating a script for merging and deleting duplicate rows from a table.

The table contains address information, and uses an integer field for storing information about the email as bit flags (column name value). For example, if bit 1 is set in value, that means the record is a primary address.

There are instances of the same email being entered twice, but sometimes with different values. To resolve this, I need to take the value from all duplicates, assign them to one surviving record and delete the rest.

My biggest headache so far as been with the merging of the records. What I want to do is bitwise OR all values of duplicate records together.

From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.

However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.

We have three problems here:

  1. Select a first record for each set of duplicates
  2. Update this record with bitwise OR of all values in its set
  3. Delete all other records

Step 1 is easy to do using ROW_NUMBER().

Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.

See this article I wrote some time ago on how to do this:

Now, the main problem is step 2.

SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.

The main idea here is that for bit values, aggregate OR and AND can be replaced with MAX and MIN, accordingly.

All we need is to split each value into the bits, aggregate each bit and merge the results together.

Let's create a sample table:

CREATE SCHEMA [20090713_bitwise]
CREATE TABLE t_address (
id INT NOT NULL PRIMARY KEY,
address VARCHAR(50) NOT NULL,
value INT NOT NULL
)
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 20
BEGIN
        INSERT
        INTO    [20090713_bitwise].t_address
        VALUES  (
                @cnt,
                'address' + CAST((@cnt - 1) % 5 + 1 AS VARCHAR) + '@example.com',
                POWER(2, @cnt - 1)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO

Here's what we have in this table:

SELECT  *
FROM    [20090713_bitwise].t_address
id address value
1 address1@example.com 1
2 address2@example.com 2
3 address3@example.com 4
4 address4@example.com 8
5 address5@example.com 16
6 address1@example.com 32
7 address2@example.com 64
8 address3@example.com 128
9 address4@example.com 256
10 address5@example.com 512
11 address1@example.com 1024
12 address2@example.com 2048
13 address3@example.com 4096
14 address4@example.com 8192
15 address5@example.com 16384
16 address1@example.com 32768
17 address2@example.com 65536
18 address3@example.com 131072
19 address4@example.com 262144
20 address5@example.com 524288

We need to find a bitwise OR aggregate for each address.

To do this, we need to generate a dummy resultset of values from 0 to 30. This will give us 31 records, one for each bit (possible for a signed integer).

For each bit record, we need to make a bitmask. SQL Server lacks bit shift operations, but they can be easily emulated using POWER(2).

Here's the bit resultset applied to record 1 from our table:

WITH    bits AS
(
SELECT  0 AS b
UNION ALL
SELECT  b + 1
FROM    bits
WHERE   b <= 29
        )
SELECT  b, a.*, POWER(2, b) AS bitmask, value & POWER(2, b) AS bitset
FROM    bits
CROSS JOIN
        [20090713_bitwise].t_address a
WHERE   id = 10

b id address value bitmask bitset
0 10 address5@example.com 512 1 0
1 10 address5@example.com 512 2 0
2 10 address5@example.com 512 4 0
3 10 address5@example.com 512 8 0
4 10 address5@example.com 512 16 0
5 10 address5@example.com 512 32 0
6 10 address5@example.com 512 64 0
7 10 address5@example.com 512 128 0
8 10 address5@example.com 512 256 0
9 10 address5@example.com 512 512 512
10 10 address5@example.com 512 1024 0
11 10 address5@example.com 512 2048 0
12 10 address5@example.com 512 4096 0
13 10 address5@example.com 512 8192 0
14 10 address5@example.com 512 16384 0
15 10 address5@example.com 512 32768 0
16 10 address5@example.com 512 65536 0
17 10 address5@example.com 512 131072 0
18 10 address5@example.com 512 262144 0
19 10 address5@example.com 512 524288 0
20 10 address5@example.com 512 1048576 0
21 10 address5@example.com 512 2097152 0
22 10 address5@example.com 512 4194304 0
23 10 address5@example.com 512 8388608 0
24 10 address5@example.com 512 16777216 0
25 10 address5@example.com 512 33554432 0
26 10 address5@example.com 512 67108864 0
27 10 address5@example.com 512 134217728 0
28 10 address5@example.com 512 268435456 0
29 10 address5@example.com 512 536870912 0
30 10 address5@example.com 512 1073741824 0

Now we should find a maximum of bitset among all addresses for each bit. Since a given bit always produces either POWER(2, b) in bitand if the bit is set in value or a zero if it's not, the groupwise maximum of bitset will be an aggregate OR of all bits in this position over the group:

WITH    bits AS
(
SELECT  0 AS b
UNION ALL
SELECT  b + 1
FROM    bits
WHERE   b <= 29
        )
SELECT  a.id, MAX(value & POWER(2, b)) AS bitor
FROM    bits
CROSS JOIN
        [20090713_bitwise].t_address a
WHERE   b = 4
        AND address = 'address5@example.com'
GROUP BY
        id WITH ROLLUP
id bitor
5 16
10 0
15 0
20 0
16

Here are the records corresponding to bit 4. It is set for the record 5 and not set for other records, so the aggregate OR will set in the result.

Then we need to SUM the maximums of all bitsets we got for each group:

WITH    bits AS
(
SELECT  0 AS b
UNION ALL
SELECT  b + 1
FROM    bits
WHERE   b <= 29
        )
SELECT  a.*,
        (
        SELECT  SUM(bitor)
        FROM    (
                SELECT  b, MAX(value & POWER(2, b)) AS bitor
                FROM    [20090713_bitwise].t_address ai
                CROSS JOIN
                        bits
                WHERE   ai.address = a.address
                GROUP BY
                        b
                ) bo
        ) AS bitor_agg
FROM    [20090713_bitwise].t_address a
id address value bitor_agg
1 address1@example.com 1 33825
2 address2@example.com 2 67650
3 address3@example.com 4 135300
4 address4@example.com 8 270600
5 address5@example.com 16 541200
6 address1@example.com 32 33825
7 address2@example.com 64 67650
8 address3@example.com 128 135300
9 address4@example.com 256 270600
10 address5@example.com 512 541200
11 address1@example.com 1024 33825
12 address2@example.com 2048 67650
13 address3@example.com 4096 135300
14 address4@example.com 8192 270600
15 address5@example.com 16384 541200
16 address1@example.com 32768 33825
17 address2@example.com 65536 67650
18 address3@example.com 131072 135300
19 address4@example.com 262144 270600
20 address5@example.com 524288 541200

This query returns aggregate OR for all values in the group. For instance, 33825 is 0x8421 in hex, which is an OR combination of individual bits 0, 5, 10 and 15 set (bits offsets are 0-based).

We now need to UPDATE the value of the first record in each group with the new value. This is easily done using ROW_NUMBER():

WITH    bits AS
(
SELECT  0 AS b
UNION ALL
SELECT  b + 1
FROM    bits
WHERE   b <= 29
        ),
        v AS
        (
        SELECT  a.*,
                (
                SELECT  SUM(bitor)
                FROM    (
                        SELECT  b, MAX(value & POWER(2, b)) AS bitor
                        FROM    [20090713_bitwise].t_address ai
                        CROSS JOIN
                                bits
                        WHERE   ai.address = a.address
                        GROUP BY
                                b
                        ) bo
                ) AS bitor_agg
        FROM    (
                SELECT  ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn
                FROM    [20090713_bitwise].t_address ai
                ) a
        WHERE   rn = 1        
        )
UPDATE  v
SET     value = bitor_agg

Here's what we get after this UPDATE:

SELECT  *
FROM    [20090713_bitwise].t_address
id address value
1 address1@example.com 33825
2 address2@example.com 67650
3 address3@example.com 135300
4 address4@example.com 270600
5 address5@example.com 541200
6 address1@example.com 32
7 address2@example.com 64
8 address3@example.com 128
9 address4@example.com 256
10 address5@example.com 512
11 address1@example.com 1024
12 address2@example.com 2048
13 address3@example.com 4096
14 address4@example.com 8192
15 address5@example.com 16384
16 address1@example.com 32768
17 address2@example.com 65536
18 address3@example.com 131072
19 address4@example.com 262144
20 address5@example.com 524288

Finally, we just need to get rid of duplicates. Again, using CTEs and ROW_NUMBER():

WITH    v AS
(
SELECT  ai.*, ROW_NUMBER() OVER (PARTITION BY address ORDER BY id) AS rn
FROM    [20090713_bitwise].t_address ai
)
DELETE
FROM    v
WHERE   rn > 1

And the final resultset:

SELECT  *
FROM    [20090713_bitwise].t_address
id address value
1 address1@example.com 33825
2 address2@example.com 67650
3 address3@example.com 135300
4 address4@example.com 270600
5 address5@example.com 541200

That is exactly what we need.

Written by Quassnoi

July 13th, 2009 at 11:00 pm

Posted in SQL Server

2 Responses to 'SQL Server: aggregate bitwise OR'

Subscribe to comments with RSS

  1. Didn’t read everything, but wanted to mention that for just getting the bitwise-or sum one can use

    SELECT SUM(DISTINCT bitwise-value)

    Ayalisha

    5 Mar 13 at 14:06

  2. @Ayalisha: what do I get if I sum 0x01 and 0x03?

    Quassnoi

    5 Mar 13 at 15:09

Leave a Reply