SQL Server: deleting adjacent values (improved) at EXPLAIN EXTENDED

SQL Server: deleting adjacent values (improved)

Comments enabled. I *really* need your comment

Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:

SQL Server: deleting adjacent values

A quick reminder of the problem, taken from Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount

1 100 10 21 9.345

2 100 10 22 9.367

3 200 11 21 4.150

4 300 11 21 4.350

5 300 11 22 4.734

6 300 11 23 5.106

7 400 13 21 10.257

8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

id	unitId	day	interval	amount
1	100	10	21	9.345
2	100	10	22	9.367
3	200	11	21	4.150
4	300	11	21	4.350
5	300	11	22	4.734
6	300	11	23	5.106
7	400	13	21	10.257
8	400	13	22	10.428

To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.

In the previous article, I used a subquery to do this:

WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.day = qo.day
                                AND qi.interval < = qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    [20090803_adjacent].records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.day = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    [20090803_adjacent].records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

This subquery traverses adjacent intervals until it find the first interval that has no predecessor.

Its id then serves as a range identifier in a PARTITION BY clause.

However, this query can be improved.

Assume we have the following intervals:

interval
3
4
5
6
10
11
12
20
21

We have three ranges here: 3:6, 10:12 and 20:21.

To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:

interval	ROW_NUMBER
3	1
4	2
5	3
6	4
10	5
11	6
12	7
20	8
21	9

ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.

This means that we can just subtract ROW_NUMBER from the interval:

interval	ROW_NUMBER	ROW_NUMBER - interval
3	1	2
4	2	2
5	3	2
6	4	2
10	5	5
11	6	5
12	7	5
20	8	12
21	9	12

Within each range, this difference is a constant. When a gap occurs, this difference changes.

Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.

This should be more efficient since it does not use the subqueries.

Let's create a sample table and check:

SET NOCOUNT ON
GO
DROP TABLE [20090803_adjacent].records
DROP SCHEMA [20090803_adjacent]
GO
CREATE SCHEMA [20090803_adjacent]
CREATE TABLE records
        (
        id INT NOT NULL,
        unitId INT NOT NULL,
        day INT NOT NULL,
        interval INT NOT NULL,
        amount FLOAT NOT NULL
        )
GO
ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval)
CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id)
GO
SELECT  RAND(20090803)
BEGIN TRANSACTION
DECLARE @pk INT
DECLARE @cnt INT
SET @pk = 0
WHILE @pk < 100000
BEGIN
        SET @cnt = 0
        WHILE @cnt < 5
        BEGIN
                INSERT
                INTO    &#91;20090803_adjacent&#93;.records
                SELECT  @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10
                WHERE   RAND() < 0.9
                SET @cnt = @cnt + 1
        END
        SET @pk = @pk + 1
END
COMMIT
&#91;/sourcecode&#93;

About <strong>450,000</strong> rows in the table above, with <strong>100,000</strong> partitions (i. e. distinct <code>(uninit, day)</code> pairs).

Deleting with a subquery:


WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.[day] = qo.[day]
                                AND qi.interval <= qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    &#91;20090803_adjacent&#93;.records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.&#91;day&#93; = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    &#91;20090803_adjacent&#93;.records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0
&#91;/sourcecode&#93;

<pre>
Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 105, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'records'. Scan count 449929, logical reads 2442340, physical reads 427, read-ahead reads 34, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 15422 ms,  elapsed time = 12983 ms.
</pre>

<strong>13 seconds.</strong>

Now, let's try with the improved query:


WITH    source AS
        (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY unitid, day ORDER BY interval) rn
        FROM    [20090804_adjacent].records
        ),
        rows AS
        (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY unitid, day, interval - rn ORDER BY interval) AS rnm
        FROM    source
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

Table 'records'. Scan count 1, logical reads 1077971, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 3547 ms,  elapsed time = 4959 ms.

Not only this query is much more concise and elegant, but it's also more efficient: less than 5 seconds.

Written by Quassnoi

August 4th, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED