EXPLAIN EXTENDED

How to create fast database queries

SQL Server: random records avoiding CTE reevaluation

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to get 5 random records from a table plus a further record based on data from the users preferences.

How can I combine the two statements whilst ensuring that no results are repeated (i. e. the favourite is not present in the 5 random records)?

We should select the 6th record so that it would satisfy two conditions: first, it should be preferred by user, second, it should not be among the 5 random records selected earlier.

The problem with random records it that they are, um, random. We usually cannot reproduce 5 random records to apply further conditions on them.

That's where the CTEs would come handy, but there's a little problem.

A CTE, as Microsoft understands them, is just an alias for an online view.

Therefore, each time a CTE is used in a query, it may (or may not) be reevaluated.

It's not a problem for a plain CTEs which are built out of static columns.

However, this is totally a problem when we try to UNION two queries using an undeterministic CTE (like, ordered by NEWID()), as we would do in this case.

Let's create a sample table and see what happens:

CREATE SCHEMA [20090728_random]
CREATE TABLE records
        (
        id INT NOT NULL PRIMARY KEY,
        preferred INT NOT NULL,
        value VARCHAR(20) NOT NULL
        )
GO
BEGIN TRANSACTION
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 100000
BEGIN
        INSERT
        INTO    &#91;20090728_random&#93;.records
        VALUES  (
                @cnt,
                CASE @cnt % 1000 WHEN 0 THEN 1 ELSE 0 END,
                'Record ' + CAST(@cnt AS VARCHAR)
                )
        SET @cnt = @cnt + 1
END
COMMIT
&#91;/sourcecode&#93;

There are <strong>100,000</strong> rows in the table, with <strong>100</strong> of them marked as <code>preferred</code>.

Let's see what's happening when we try to select <code>TOP 5 * ORDER BY RANDOM()</code> in a <strong>CTE</strong> and then <code>UNION ALL</code> two <strong>CTE</strong>s:


WITH    rows AS
        (
        SELECT  TOP 5 r.*, 1 AS rn
        FROM    [20090728_random].records r
        ORDER BY
                NEWID()
        )
SELECT  *
FROM    rows
UNION ALL
SELECT  *
FROM    rows

id preferred value rn
13696 0 Record 13696 1
81375 0 Record 81375 1
97171 0 Record 97171 1
15569 0 Record 15569 1
65057 0 Record 65057 1
93492 0 Record 93492 1
63785 0 Record 63785 1
36144 0 Record 36144 1
52825 0 Record 52825 1
53930 0 Record 53930 1
10 rows fetched in 0.0005s (0.1386s)

We have two completely different sets. This is because the CTE was reevaluated in the second query.

This means we will not be able to use the CTE to check the 6th record. The CTE will be reevaluated, and the check will fail.

We need to build our query this way:

  1. Build a CTE called randoms to select 5 random records
  2. Build a CTE called prefs to select all preferred records
  3. FULL OUTER JOIN these CTEs on id
  4. Select TOP 6 results, ordering by randoms.id DESC
  5. JOIN the table using COALESCE(randoms.id, prefs.id)

This will select each record exactly once from each of the CTEs, even it belongs to both of them.

Ordering by randoms.id DESC makes sure that 5 random records will be selected first.

The 6th record will not be among the 5 random ones (or it would satisfy the FULL OUTER JOIN condition and would be selected earlier), and it's guaranteed to have preferred = 1 (there may be at most 5 records not satisfying this condition, and they already had been selected).

And here's the query:

WITH    randoms AS
        (
        SELECT  TOP 5 r.*
        FROM    [20090728_random].records r
        ORDER BY
                NEWID()
        ),
        prefs AS
        (
        SELECT  id
        FROM    [20090728_random].records
        WHERE   preferred = 1
        )
SELECT  TOP 6 r.*
FROM    randoms
FULL OUTER JOIN
        prefs
ON      randoms.id = prefs.id
JOIN    [20090728_random].records r
ON      r.id = COALESCE (randoms.id, prefs.id)
ORDER BY
        randoms.id DESC

id preferred value
83022 0 Record 83022
79921 0 Record 79921
78241 0 Record 78241
58192 0 Record 58192
18954 0 Record 18954
1000 1 Record 1000
6 rows fetched in 0.0004s (0.1766s)

First 5 records are random, the 6th one is preferred.

Written by Quassnoi

July 28th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply