EXPLAIN EXTENDED

How to create fast database queries

SQL Server: deleting adjacent values

Comments enabled. I *really* need your comment

From Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:

SELECT  *
FROM    mytable qi
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable t
        WHERE   t.unitid = qi.unitid
                AND t.day = qi.day
                AND t.interval = qi.interval - 1
        )

However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.

If we just delete everything, the query above will not return any records just as well, but it's not our goal.

The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.

So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).

In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.

However, we can formulate the delete condition.

We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.

The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).

The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).

The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.

In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.

The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.

It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.

Let's create a sample table and compose the query:

SET NOCOUNT ON
GO
DROP TABLE [20090803_adjacent].records
DROP SCHEMA [20090803_adjacent]
GO
CREATE SCHEMA [20090803_adjacent]
CREATE TABLE records
        (
        id INT NOT NULL,
        unitId INT NOT NULL,
        day INT NOT NULL,
        interval INT NOT NULL,
        amount FLOAT NOT NULL
        )
GO
ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval)
CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id)
GO
SELECT  RAND(20090803)
BEGIN TRANSACTION
DECLARE @pk INT
DECLARE @cnt INT
SET @pk = 0
WHILE @pk < 100000
BEGIN
        SET @cnt = 0
        WHILE @cnt < 5
        BEGIN
                INSERT
                INTO    &#91;20090803_adjacent&#93;.records
                SELECT  @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10
                WHERE   RAND() < 0.9
                SET @cnt = @cnt + 1
        END
        SET @pk = @pk + 1
END
COMMIT
&#91;/sourcecode&#93;

The table contains abount <strong>450,000</strong> records with exactly <strong>100,000</strong> distinct values of <code>(unitId, day)</code> and suссessive <strong>interval</strong>s from <strong>1</strong> to <strong>5</strong> with random gaps.

Here's the query:


WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.[day] = qo.[day]
                                AND qi.interval <= qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    &#91;20090803_adjacent&#93;.records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.&#91;day&#93; = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    &#91;20090803_adjacent&#93;.records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0
&#91;/sourcecode&#93;

<pre>
SQL Server parse and compile time: 
   CPU time = 16 ms, elapsed time = 37 ms.
Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 1598, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'records'. Scan count 449929, logical reads 2435811, physical reads 1386, read-ahead reads 3283, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 15860 ms,  elapsed time = 15832 ms.
</pre>

<strong>15.8 seconds</strong>, quite a decent performance for a <strong>DML</strong> query that affects <strong>177,350</strong> rows.

Now let's check:


SELECT  *
FROM    [20090803_adjacent].records qi
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    [20090803_adjacent].records t
        WHERE   t.unitid = qi.unitid
                AND t.[day] = qi.day
                AND t.interval = qi.interval - 1
        )
id unitId day interval amount
0 rows fetched in 0.0002s (0.3912s)

Nothing were returned, as it should be.

Update:

See this article for a more efficient way to cope with this problem:

Written by Quassnoi

August 3rd, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply