Archive for August 4th, 2009
SQL Server: deleting adjacent values (improved)
Comments enabled. I *really* need your comment
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY
(
SELECT TOP 1 qi.id AS mint
FROM [20090803_adjacent].records qi
WHERE qi.unitid = qo.unitid
AND qi.day = qo.day
AND qi.interval < = qo.interval
AND NOT EXISTS
(
SELECT NULL
FROM [20090803_adjacent].records t
WHERE t.unitid = qi.unitid
AND t.day = qi.day
AND t.interval = qi.interval - 1
)
ORDER BY
qi.interval DESC
)
ORDER BY interval
) AS rnm
FROM [20090803_adjacent].records qo
)
DELETE
FROM rows
WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id then serves as a range identifier in a PARTITION BY clause.
However, this query can be improved.
Assume we have the following intervals:
| interval |
|---|
| 3 |
| 4 |
| 5 |
| 6 |
| 10 |
| 11 |
| 12 |
| 20 |
| 21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:
| interval | ROW_NUMBER |
|---|---|
| 3 | 1 |
| 4 | 2 |
| 5 | 3 |
| 6 | 4 |
| 10 | 5 |
| 11 | 6 |
| 12 | 7 |
| 20 | 8 |
| 21 | 9 |
ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.
This means that we can just subtract ROW_NUMBER from the interval:
| interval | ROW_NUMBER | ROW_NUMBER - interval |
|---|---|---|
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 10 | 5 | 5 |
| 11 | 6 | 5 |
| 12 | 7 | 5 |
| 20 | 8 | 12 |
| 21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check:
Subscribe in a reader