SQL Server: deleting adjacent values (improved)
Comments enabled. I *really* need your comment
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY
(
SELECT TOP 1 qi.id AS mint
FROM [20090803_adjacent].records qi
WHERE qi.unitid = qo.unitid
AND qi.day = qo.day
AND qi.interval < = qo.interval
AND NOT EXISTS
(
SELECT NULL
FROM [20090803_adjacent].records t
WHERE t.unitid = qi.unitid
AND t.day = qi.day
AND t.interval = qi.interval - 1
)
ORDER BY
qi.interval DESC
)
ORDER BY interval
) AS rnm
FROM [20090803_adjacent].records qo
)
DELETE
FROM rows
WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id then serves as a range identifier in a PARTITION BY clause.
However, this query can be improved.
Assume we have the following intervals:
| interval |
|---|
| 3 |
| 4 |
| 5 |
| 6 |
| 10 |
| 11 |
| 12 |
| 20 |
| 21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:
| interval | ROW_NUMBER |
|---|---|
| 3 | 1 |
| 4 | 2 |
| 5 | 3 |
| 6 | 4 |
| 10 | 5 |
| 11 | 6 |
| 12 | 7 |
| 20 | 8 |
| 21 | 9 |
ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.
This means that we can just subtract ROW_NUMBER from the interval:
| interval | ROW_NUMBER | ROW_NUMBER - interval |
|---|---|---|
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 10 | 5 | 5 |
| 11 | 6 | 5 |
| 12 | 7 | 5 |
| 20 | 8 | 12 |
| 21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check:
SET NOCOUNT ON
GO
DROP TABLE [20090803_adjacent].records
DROP SCHEMA [20090803_adjacent]
GO
CREATE SCHEMA [20090803_adjacent]
CREATE TABLE records
(
id INT NOT NULL,
unitId INT NOT NULL,
day INT NOT NULL,
interval INT NOT NULL,
amount FLOAT NOT NULL
)
GO
ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval)
CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id)
GO
SELECT RAND(20090803)
BEGIN TRANSACTION
DECLARE @pk INT
DECLARE @cnt INT
SET @pk = 0
WHILE @pk < 100000
BEGIN
SET @cnt = 0
WHILE @cnt < 5
BEGIN
INSERT
INTO [20090803_adjacent].records
SELECT @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10
WHERE RAND() < 0.9
SET @cnt = @cnt + 1
END
SET @pk = @pk + 1
END
COMMIT
[/sourcecode]
About <strong>450,000</strong> rows in the table above, with <strong>100,000</strong> partitions (i. e. distinct <code>(uninit, day)</code> pairs).
Deleting with a subquery:
WITH rows AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY
(
SELECT TOP 1 qi.id AS mint
FROM [20090803_adjacent].records qi
WHERE qi.unitid = qo.unitid
AND qi.[day] = qo.[day]
AND qi.interval <= qo.interval
AND NOT EXISTS
(
SELECT NULL
FROM [20090803_adjacent].records t
WHERE t.unitid = qi.unitid
AND t.[day] = qi.day
AND t.interval = qi.interval - 1
)
ORDER BY
qi.interval DESC
)
ORDER BY interval
) AS rnm
FROM [20090803_adjacent].records qo
)
DELETE
FROM rows
WHERE rnm % 2 = 0
[/sourcecode]
<pre>
Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 105, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'records'. Scan count 449929, logical reads 2442340, physical reads 427, read-ahead reads 34, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 15422 ms, elapsed time = 12983 ms.
</pre>
<strong>13 seconds.</strong>
Now, let's try with the improved query:
WITH source AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY unitid, day ORDER BY interval) rn
FROM [20090804_adjacent].records
),
rows AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY unitid, day, interval - rn ORDER BY interval) AS rnm
FROM source
)
DELETE
FROM rows
WHERE rnm % 2 = 0
Table 'records'. Scan count 1, logical reads 1077971, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3547 ms, elapsed time = 4959 ms.
Not only this query is much more concise and elegant, but it's also more efficient: less than 5 seconds.
Subscribe in a reader