EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Selecting records unique on each column

From Stack Overflow:

I need to select 3 columns from a table, but I need each value from any column to be unique in the resultset for this column.

This query:

SELECT DISTINCT TOP 10 a, b, c
FROM   x

will return 10 distinct sets.

How do I do it?

Here we will need to make sure that for each column in a record we are going to return, no previous record returned so far does not contain the same value of the column.

Let’s create a sample table and construct the query:

SET NOCOUNT ON
GO
DROP TABLE [20090603_unique].t_unique
DROP SCHEMA [20090603_unique]
GO
CREATE SCHEMA [20090603_unique]
CREATE TABLE t_unique (
        id INT NOT NULL PRIMARY KEY,
        a INT NOT NULL,
        b INT NOT NULL,
        c INT NOT NULL
)
GO
DECLARE @cnt INT
DECLARE @a INT
DECLARE @b INT
DECLARE @c INT
SELECT  RAND(20090603)
BEGIN TRANSACTION
SET @cnt = 1;
WHILE @cnt <= 10000
BEGIN
        SET @a = FLOOR(RAND() * 100) + 1
        SET @b = FLOOR(RAND() * 100) + 1
        SET @c = FLOOR(RAND() * 100) + 1
        INSERT
        INTO    [20090603_unique].t_unique
        VALUES  (@cnt, @a, @b, @c)
        SET @cnt = @cnt + 1
END
COMMIT
GO

This table has 10,000 records.

Values of the columns a, b, and с are random, from 1 to 100.

Now, how do we do it?

We need to make sure no previously returned row contains the same values of any of the columns.

Since we cannot efficiently look back into the resultset already returned, we will use more strict condition.

We will arrange all rows in a certain order, and will filter out those rows that could be possibly returned earlier (even if they were not).

For this to work, we need to order the rows so that probability of a row with same value of any of the columns to be encountered earlier is minimal.

This is best achieved with ordering by the sum of the columns modulo 2. This will give quite a fair distribution.

Here’s the query:

WITH    q AS
        (
        SELECT  qi.*, ROW_NUMBER() OVER (ORDER BY (a + b + c) % 2) AS rn
        FROM    (
                SELECT  DISTINCT a, b, c
                FROM    [20090603_unique].t_unique u
                ) qi
        )
SELECT  TOP 10 a, b, c
FROM    q
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    q qa
        WHERE   qa.rn < q.rn
                AND qa.a = q.a
        )
        AND NOT EXISTS
        (
        SELECT  1
        FROM    q qb
        WHERE   qb.rn < q.rn
                AND qb.b = q.b
        )
        AND NOT EXISTS
        (
        SELECT  1
        FROM    q qc
        WHERE   qc.rn < q.rn
                AND qc.c = q.c
        )
a b c
4 10 10
6 70 4
86 17 81
85 19 48
76 82 62
5 55 42
54 41 33
94 92 20
55 14 47
48 88 56
10 rows fetched in 0.0004s (0.1029s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't_unique'. Scan count 4, logical reads 132, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 103 ms.

What do we do here?

  1. First, we eliminate distinct sets using DISTINCT: since they always come in same order, they would spoil the whole query
  2. We assign a ROW_NUMBER to each set, to know if it had already been eligible for returning
  3. Finally, we take every row from the resultset and check it against all rows that had ever been checked before (that is, rows with lower rn)

Subqueries used with NOT EXISTS use HASH ANTI JOIN and are quite efficient.

Written by Quassnoi

June 3rd, 2009 at 11:00 pm

Posted in SQL Server

Comments are closed.