EXPLAIN EXTENDED

How to create fast database queries

Selecting records unique on each column

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to select 3 columns from a table, but I need each value from any column to be unique in the resultset for this column.

This query:

SELECT DISTINCT TOP 10 a, b, c
FROM   x

will return 10 distinct sets.

How do I do it?

Here we will need to make sure that for each column in a record we are going to return, no previous record returned so far does not contain the same value of the column.

Let's create a sample table and construct the query:

SET NOCOUNT ON
GO
DROP TABLE [20090603_unique].t_unique
DROP SCHEMA [20090603_unique]
GO
CREATE SCHEMA [20090603_unique]
CREATE TABLE t_unique (
id INT NOT NULL PRIMARY KEY,
a INT NOT NULL,
b INT NOT NULL,
c INT NOT NULL
)
GO
DECLARE @cnt INT
DECLARE @a INT
DECLARE @b INT
DECLARE @c INT
SELECT RAND(20090603)
BEGIN TRANSACTION
SET @cnt = 1;
WHILE @cnt <= 10000 BEGIN SET @a = FLOOR(RAND() * 100) + 1 SET @b = FLOOR(RAND() * 100) + 1 SET @c = FLOOR(RAND() * 100) + 1 INSERT INTO [20090603_unique].t_unique VALUES (@cnt, @a, @b, @c) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] This table has 10,000 records.

Values of the columns a, b, and с are random, from 1 to 100.

Now, how do we do it?

We need to make sure no previously returned row contains the same values of any of the columns.

Since we cannot efficiently look back into the resultset already returned, we will use more strict condition.

We will arrange all rows in a certain order, and will filter out those rows that could be possibly returned earlier (even if they were not).

For this to work, we need to order the rows so that probability of a row with same value of any of the columns to be encountered earlier is minimal.

This is best achieved with ordering by the sum of the columns modulo 2. This will give quite a fair distribution.

Here's the query:

WITH q AS
(
SELECT qi.*, ROW_NUMBER() OVER (ORDER BY (a + b + c) % 2) AS rn
FROM (
SELECT DISTINCT a, b, c
FROM [20090603_unique].t_unique u
) qi
)
SELECT TOP 10 a, b, c
FROM q
WHERE NOT EXISTS
(
SELECT 1
FROM q qa
WHERE qa.rn < q.rn AND qa.a = q.a ) AND NOT EXISTS ( SELECT 1 FROM q qb WHERE qb.rn < q.rn AND qb.b = q.b ) AND NOT EXISTS ( SELECT 1 FROM q qc WHERE qc.rn < q.rn AND qc.c = q.c ) [/sourcecode]

a b c
4 10 10
6 70 4
86 17 81
85 19 48
76 82 62
5 55 42
54 41 33
94 92 20
55 14 47
48 88 56
10 rows fetched in 0.0004s (0.1029s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_unique'. Scan count 4, logical reads 132, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 103 ms. 

What do we do here?

  1. First, we eliminate distinct sets using DISTINCT: since they always come in same order, they would spoil the whole query
  2. We assign a ROW_NUMBER to each set, to know if it had already been eligible for returning
  3. Finally, we take every row from the resultset and check it against all rows that had ever been checked before (that is, rows with lower rn)

Subqueries used with NOT EXISTS use HASH ANTI JOIN and are quite efficient.

Written by Quassnoi

June 3rd, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply