EXPLAIN EXTENDED

How to create fast database queries

Ordering columns independently

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table with 5 columns in it.

What's the easiest way to select all rows, but where each column is individually randomized?

All I can think of is to select each column separately, along with

ROW_NUMBER() OVER (ORDER BY NEWID()) AS lookup

, and then join each column back together on lookup.

Is there an easier way?

Though it's possible to do this by means of JOIN'ing, there is an easier way.

Since all SQL operations are row-level, we will need to make 5 independent ordered sets, which we later can use to order each column set. This implies having 5 times as much rows as there are rows in the original table.

But instead of JOIN'ing, we can use SQL Server's clause which maps rows to columns, which is called PIVOT.

We need to do the following:

  1. Generate 5 sets of column values and UNION ALL them.
  2. For each set, select a ROW_NUMBER() OVER (ORDER BY NEWID()) along with each row. It will be unique for each row in any set and will give a row identifier for PIVOT'ing.
  3. For each set, select a set identifier: just a string with column name. This will give a column identifier for PIVOT'ing
  4. Finally, PIVOT the column values: they will get into appropriate row and column, based on ROW_NUMBER() and the set name

Since PIVOT requires aggregation (in case row and column identifiers are not unique), we need some aggregation function, but our query is designed so that there is exactly one row with a given ROW_NUMBER() and set name, so any aggregation function will do.

We will use MIN.

Now, let's create a sample table and see how it works:

CREATE SCHEMA [20090522_random]
CREATE TABLE t_random (
        id INT NOT NULL PRIMARY KEY,
        col1 VARCHAR(50) NOT NULL,
        col2 VARCHAR(50) NOT NULL,
        col3 VARCHAR(50) NOT NULL,
        col4 VARCHAR(50) NOT NULL,
        col5 VARCHAR(50) NOT NULL
)
GO
CREATE INDEX IX_random_col1 ON [20090522_random].t_random(col1)
CREATE INDEX IX_random_col2 ON [20090522_random].t_random(col2)
CREATE INDEX IX_random_col3 ON [20090522_random].t_random(col3)
CREATE INDEX IX_random_col4 ON [20090522_random].t_random(col4)
CREATE INDEX IX_random_col5 ON [20090522_random].t_random(col5)
DECLARE @cnt INT;
BEGIN TRANSACTION
SET @cnt = 1;
WHILE @cnt <= 1000
BEGIN
        INSERT
        INTO    &#91;20090522_random&#93;.t_random
        VALUES  (
                @cnt,
                'Value 1 ' + CAST(@cnt AS VARCHAR),
                'Value 2 ' + CAST(@cnt AS VARCHAR),
                'Value 3 ' + CAST(@cnt AS VARCHAR),
                'Value 4 ' + CAST(@cnt AS VARCHAR),
                'Value 5 ' + CAST(@cnt AS VARCHAR)
                )
        SET @cnt = @cnt + 1
END
COMMIT
GO
&#91;/sourcecode&#93;

This table has <strong>1,000</strong> rows with <strong>5</strong> <code>VARCHAR</code> columns.

There is an index on each of the columns, so that each column set can be selected by browsing the appropriate index, without lookup up the table itself.

The query will look like this:


SELECT  *
FROM    (
        SELECT  'col1' AS name, col1 AS col, ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn
        FROM    "20090522_random".t_random
        UNION ALL
        SELECT  'col2', col2, ROW_NUMBER() OVER (ORDER BY NEWID())
        FROM    "20090522_random".t_random
        UNION ALL
        SELECT  'col3', col3, ROW_NUMBER() OVER (ORDER BY NEWID())
        FROM    "20090522_random".t_random
        UNION ALL
        SELECT  'col4', col4, ROW_NUMBER() OVER (ORDER BY NEWID())
        FROM    "20090522_random".t_random
        UNION ALL
        SELECT  'col5', col5, ROW_NUMBER() OVER (ORDER BY NEWID())
        FROM    "20090522_random".t_random
        ) q
PIVOT   (
        MIN(col)
        FOR name IN ([col1], [col2], [col3], [col4], [col5])
        ) pt

rn col1 col2 col3 col4 col5
1 Value 1 563 Value 2 676 Value 3 605 Value 4 951 Value 5 49
2 Value 1 278 Value 2 657 Value 3 281 Value 4 307 Value 5 496
3 Value 1 653 Value 2 59 Value 3 420 Value 4 354 Value 5 197
4 Value 1 635 Value 2 682 Value 3 620 Value 4 139 Value 5 31
5 Value 1 789 Value 2 217 Value 3 273 Value 4 150 Value 5 388
6 Value 1 110 Value 2 374 Value 3 740 Value 4 834 Value 5 444
7 Value 1 992 Value 2 758 Value 3 310 Value 4 151 Value 5 495
8 Value 1 741 Value 2 17 Value 3 96 Value 4 822 Value 5 73
9 Value 1 12 Value 2 599 Value 3 793 Value 4 578 Value 5 826
10 Value 1 446 Value 2 588 Value 3 941 Value 4 741 Value 5 603
997 Value 1 704 Value 2 950 Value 3 610 Value 4 600 Value 5 399
998 Value 1 898 Value 2 980 Value 3 590 Value 4 698 Value 5 747
999 Value 1 864 Value 2 241 Value 3 983 Value 4 592 Value 5 882
1000 Value 1 180 Value 2 993 Value 3 981 Value 4 470 Value 5 920
1000 rows fetched in 0.0688s (0.0483s)
Table 't_random'. Scan count 5, logical reads 25, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 49 ms. 

The values are ordered independently.

This method, though, has one drawback: it may only be used on columns of same datatype.

If we have more than one datatype, we will need to group columns by datatype, PIVOT them in separate queries and then JOIN the resultsets on rn.

Alternatively, we can cast the values into BINARY, and cast them back to appropriate types after PIVOT'ing.

Written by Quassnoi

May 22nd, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply