EXPLAIN EXTENDED

How to create fast database queries

Archive for August 3rd, 2009

SQL Server: deleting adjacent values

Comments enabled. I *really* need your comment

From Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:

SELECT  *
FROM    mytable qi
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable t
        WHERE   t.unitid = qi.unitid
                AND t.day = qi.day
                AND t.interval = qi.interval - 1
        )

However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.

If we just delete everything, the query above will not return any records just as well, but it's not our goal.

The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.

So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).

In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.

However, we can formulate the delete condition.

We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.

The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).

The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).

The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.

In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.

The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.

It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.

Let's create a sample table and compose the query:

Read the rest of this entry »

Written by Quassnoi

August 3rd, 2009 at 11:00 pm

Posted in SQL Server