SQL Server: deleting adjacent values
Comments enabled. I *really* need your comment
From Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)
are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:
SELECT * FROM mytable qi WHERE EXISTS ( SELECT NULL FROM mytable t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 )
However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.
If we just delete everything, the query above will not return any records just as well, but it's not our goal.
The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.
So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).
In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.
However, we can formulate the delete condition.
We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.
The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).
The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).
The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.
In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.
The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.
It is best done by iterating the records up from the current and returning the id
of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1)
. This can be done in a subquery right in the PARTITION BY
clause of the ROW_NUMBER
function.
Let's create a sample table and compose the query:
SET NOCOUNT ON GO DROP TABLE [20090803_adjacent].records DROP SCHEMA [20090803_adjacent] GO CREATE SCHEMA [20090803_adjacent] CREATE TABLE records ( id INT NOT NULL, unitId INT NOT NULL, day INT NOT NULL, interval INT NOT NULL, amount FLOAT NOT NULL ) GO ALTER TABLE [20090803_adjacent].records ADD CONSTRAINT PK_records_unitid_day_interval PRIMARY KEY (unitId, day, interval) CREATE UNIQUE INDEX UX_records_id ON [20090803_adjacent].records (id) GO SELECT RAND(20090803) BEGIN TRANSACTION DECLARE @pk INT DECLARE @cnt INT SET @pk = 0 WHILE @pk < 100000 BEGIN SET @cnt = 0 WHILE @cnt < 5 BEGIN INSERT INTO [20090803_adjacent].records SELECT @pk * 5 + @cnt, @pk / 10000 + 1, @pk % 10000 + 1, @cnt + 1, RAND() * 10 WHERE RAND() < 0.9 SET @cnt = @cnt + 1 END SET @pk = @pk + 1 END COMMIT [/sourcecode] The table contains abount <strong>450,000</strong> records with exactly <strong>100,000</strong> distinct values of <code>(unitId, day)</code> and suссessive <strong>interval</strong>s from <strong>1</strong> to <strong>5</strong> with random gaps. Here's the query: WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.[day] = qo.[day] AND qi.interval <= qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.[day] = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0 [/sourcecode] <pre> SQL Server parse and compile time: CPU time = 16 ms, elapsed time = 37 ms. Table 'Worktable'. Scan count 1182425, logical reads 5190358, physical reads 1598, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'records'. Scan count 449929, logical reads 2435811, physical reads 1386, read-ahead reads 3283, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 15860 ms, elapsed time = 15832 ms. </pre> <strong>15.8 seconds</strong>, quite a decent performance for a <strong>DML</strong> query that affects <strong>177,350</strong> rows. Now let's check: SELECT * FROM [20090803_adjacent].records qi WHERE EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.[day] = qi.day AND t.interval = qi.interval - 1 )
id | unitId | day | interval | amount | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 rows fetched in 0.0002s (0.3912s) |
Nothing were returned, as it should be.
Update:
See this article for a more efficient way to cope with this problem: