Archive for August 4th, 2009
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428
(UnitID, Day, Interval)are unique.
In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of
intervals and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.day = qo.day AND qi.interval < = qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
id then serves as a range identifier in a
PARTITION BY clause.
However, this query can be improved.
Assume we have the following intervals:
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition
ROW_NUMBER of these intervals:
ROW_NUMBERs, by definition, don't have gaps within the partition, while the
intervals can have gaps. However, inside each range
ROW_NUMBERs increase with
This means that we can just subtract
ROW_NUMBER from the
|interval||ROW_NUMBER||ROW_NUMBER - interval|
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check: