SQL Server: deleting adjacent values (improved)
Comments enabled. I *really* need your comment
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)
are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of interval
s and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.day = qo.day AND qi.interval < = qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id
then serves as a range identifier in a PARTITION BY
clause.
However, this query can be improved.
Assume we have the following intervals:
interval |
---|
3 |
4 |
5 |
6 |
10 |
11 |
12 |
20 |
21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER
of these intervals:
interval | ROW_NUMBER |
---|---|
3 | 1 |
4 | 2 |
5 | 3 |
6 | 4 |
10 | 5 |
11 | 6 |
12 | 7 |
20 | 8 |
21 | 9 |
ROW_NUMBER
s, by definition, don't have gaps within the partition, while the interval
s can have gaps. However, inside each range ROW_NUMBER
s increase with interval
s.
This means that we can just subtract ROW_NUMBER
from the interval
:
interval | ROW_NUMBER | ROW_NUMBER - interval |
---|---|---|
3 | 1 | 2 |
4 | 2 | 2 |
5 | 3 | 2 |
6 | 4 | 2 |
10 | 5 | 5 |
11 | 6 | 5 |
12 | 7 | 5 |
20 | 8 | 12 |
21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check:
SET NOCOUNT ON GO DROP TABLE [20090803_adjacent].records DROP SCHEMA [20090803_adjacent] GO CREATE SCHEMA [20090803_adjacent] CREATE TABLE records ( id INT NOT NULL, unitId INT NOT NULL, day INT NOT NULL, interval INT NOT NULL, amount FLOAT NOT NULL ) GO ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval) CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id) GO SELECT RAND(20090803) BEGIN TRANSACTION DECLARE @pk INT DECLARE @cnt INT SET @pk = 0 WHILE @pk < 100000 BEGIN SET @cnt = 0 WHILE @cnt < 5 BEGIN INSERT INTO [20090803_adjacent].records SELECT @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10 WHERE RAND() < 0.9 SET @cnt = @cnt + 1 END SET @pk = @pk + 1 END COMMIT [/sourcecode] About <strong>450,000</strong> rows in the table above, with <strong>100,000</strong> partitions (i. e. distinct <code>(uninit, day)</code> pairs). Deleting with a subquery: WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.[day] = qo.[day] AND qi.interval <= qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.[day] = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0 [/sourcecode] <pre> Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 105, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'records'. Scan count 449929, logical reads 2442340, physical reads 427, read-ahead reads 34, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 15422 ms, elapsed time = 12983 ms. </pre> <strong>13 seconds.</strong> Now, let's try with the improved query: WITH source AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY unitid, day ORDER BY interval) rn FROM [20090804_adjacent].records ), rows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY unitid, day, interval - rn ORDER BY interval) AS rnm FROM source ) DELETE FROM rows WHERE rnm % 2 = 0
Table 'records'. Scan count 1, logical reads 1077971, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3547 ms, elapsed time = 4959 ms.
Not only this query is much more concise and elegant, but it's also more efficient: less than 5 seconds.