EXPLAIN EXTENDED

How to create fast database queries

SQL Server: deleting adjacent values (improved)

Comments enabled. I *really* need your comment

Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:

A quick reminder of the problem, taken from Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.

In the previous article, I used a subquery to do this:

WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.day = qo.day
                                AND qi.interval < = qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    [20090803_adjacent].records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.day = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    [20090803_adjacent].records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

This subquery traverses adjacent intervals until it find the first interval that has no predecessor.

Its id then serves as a range identifier in a PARTITION BY clause.

However, this query can be improved.

Assume we have the following intervals:

interval
3
4
5
6
10
11
12
20
21

We have three ranges here: 3:6, 10:12 and 20:21.

To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:

interval ROW_NUMBER
3 1
4 2
5 3
6 4
10 5
11 6
12 7
20 8
21 9

ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.

This means that we can just subtract ROW_NUMBER from the interval:

interval ROW_NUMBER ROW_NUMBER - interval
3 1 2
4 2 2
5 3 2
6 4 2
10 5 5
11 6 5
12 7 5
20 8 12
21 9 12

Within each range, this difference is a constant. When a gap occurs, this difference changes.

Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.

This should be more efficient since it does not use the subqueries.

Let's create a sample table and check:

SET NOCOUNT ON
GO
DROP TABLE [20090803_adjacent].records
DROP SCHEMA [20090803_adjacent]
GO
CREATE SCHEMA [20090803_adjacent]
CREATE TABLE records
        (
        id INT NOT NULL,
        unitId INT NOT NULL,
        day INT NOT NULL,
        interval INT NOT NULL,
        amount FLOAT NOT NULL
        )
GO
ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval)
CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id)
GO
SELECT  RAND(20090803)
BEGIN TRANSACTION
DECLARE @pk INT
DECLARE @cnt INT
SET @pk = 0
WHILE @pk < 100000
BEGIN
        SET @cnt = 0
        WHILE @cnt < 5
        BEGIN
                INSERT
                INTO    &#91;20090803_adjacent&#93;.records
                SELECT  @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10
                WHERE   RAND() < 0.9
                SET @cnt = @cnt + 1
        END
        SET @pk = @pk + 1
END
COMMIT
&#91;/sourcecode&#93;

About <strong>450,000</strong> rows in the table above, with <strong>100,000</strong> partitions (i. e. distinct <code>(uninit, day)</code> pairs).

Deleting with a subquery:


WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.[day] = qo.[day]
                                AND qi.interval <= qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    &#91;20090803_adjacent&#93;.records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.&#91;day&#93; = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    &#91;20090803_adjacent&#93;.records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0
&#91;/sourcecode&#93;

<pre>
Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 105, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'records'. Scan count 449929, logical reads 2442340, physical reads 427, read-ahead reads 34, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 15422 ms,  elapsed time = 12983 ms.
</pre>

<strong>13 seconds.</strong>

Now, let's try with the improved query:


WITH    source AS
        (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY unitid, day ORDER BY interval) rn
        FROM    [20090804_adjacent].records
        ),
        rows AS
        (
        SELECT  *, ROW_NUMBER() OVER (PARTITION BY unitid, day, interval - rn ORDER BY interval) AS rnm
        FROM    source
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0
Table 'records'. Scan count 1, logical reads 1077971, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 3547 ms,  elapsed time = 4959 ms.

Not only this query is much more concise and elegant, but it's also more efficient: less than 5 seconds.

Written by Quassnoi

August 4th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply