Archive for August 4th, 2009
SQL Server: deleting adjacent values (improved)
Comments enabled. I *really* need your comment
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)
are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of interval
s and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.day = qo.day AND qi.interval < = qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id
then serves as a range identifier in a PARTITION BY
clause.
However, this query can be improved.
Assume we have the following intervals:
interval |
---|
3 |
4 |
5 |
6 |
10 |
11 |
12 |
20 |
21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER
of these intervals:
interval | ROW_NUMBER |
---|---|
3 | 1 |
4 | 2 |
5 | 3 |
6 | 4 |
10 | 5 |
11 | 6 |
12 | 7 |
20 | 8 |
21 | 9 |
ROW_NUMBER
s, by definition, don't have gaps within the partition, while the interval
s can have gaps. However, inside each range ROW_NUMBER
s increase with interval
s.
This means that we can just subtract ROW_NUMBER
from the interval
:
interval | ROW_NUMBER | ROW_NUMBER - interval |
---|---|---|
3 | 1 | 2 |
4 | 2 | 2 |
5 | 3 | 2 |
6 | 4 | 2 |
10 | 5 | 5 |
11 | 6 | 5 |
12 | 7 | 5 |
20 | 8 | 12 |
21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check: