EXPLAIN EXTENDED

How to create fast database queries

Archive for August 4th, 2009

SQL Server: deleting adjacent values (improved)

Comments enabled. I *really* need your comment

Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:

A quick reminder of the problem, taken from Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.

In the previous article, I used a subquery to do this:

WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.day = qo.day
                                AND qi.interval < = qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    [20090803_adjacent].records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.day = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    [20090803_adjacent].records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

This subquery traverses adjacent intervals until it find the first interval that has no predecessor.

Its id then serves as a range identifier in a PARTITION BY clause.

However, this query can be improved.

Assume we have the following intervals:

interval
3
4
5
6
10
11
12
20
21

We have three ranges here: 3:6, 10:12 and 20:21.

To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:

interval ROW_NUMBER
3 1
4 2
5 3
6 4
10 5
11 6
12 7
20 8
21 9

ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.

This means that we can just subtract ROW_NUMBER from the interval:

interval ROW_NUMBER ROW_NUMBER - interval
3 1 2
4 2 2
5 3 2
6 4 2
10 5 5
11 6 5
12 7 5
20 8 12
21 9 12

Within each range, this difference is a constant. When a gap occurs, this difference changes.

Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.

This should be more efficient since it does not use the subqueries.

Let's create a sample table and check:

Read the rest of this entry »

Written by Quassnoi

August 4th, 2009 at 11:00 pm

Posted in SQL Server