EXPLAIN EXTENDED

How to create fast database queries

Ranging random values

Comments enabled. I *really* need your comment

From Stack Overflow:

Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?

The ranges should not overlap and ideally have a similar amount of values in them.

This is a good job for SQL Server's analytic function NTILE.

This function, given a parameter N, assigns a rank from 1 to N to each value, just as in quadrile (4), quintile (5), percentile (100) etc.

Lower 1/N of values get a rank of 1, from 1/N to 2/N get a rank of 2, etc.

The value, though, are integer and random, and we can get gaps between the adjacent ranges.

To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.

Say, NTILE assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.

Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.

To do this, for each value selected, we will get also the previous one and the next one.

It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.

Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.

Let's create a sample table and see how it works:

SET NOCOUNT ON
GO
DROP TABLE [20090517_ranges].t_random
DROP SCHEMA [20090517_ranges]
GO
CREATE SCHEMA [20090517_ranges]
CREATE TABLE t_random (
        id INT NOT NULL PRIMARY KEY,
        value INT NOT NULL
)
GO
CREATE INDEX IX_random_value ON [20090517_ranges].t_random (value DESC)
GO
DECLARE @cnt INT
DECLARE @rand INT
SELECT  RAND(20090517)
BEGIN TRANSACTION
SET @cnt = 1
WHILE @cnt <= 1000
BEGIN
        SET @rand = FLOOR(RAND() * 100000) * 10
        INSERT
        INTO    &#91;20090517_ranges&#93;.t_random (id, value)
        VALUES  (@cnt, @rand)
        SET @cnt = @cnt + 1
END

COMMIT
GO
&#91;/sourcecode&#93;

The table has <strong>1,000</strong> records with values randomly distributed from <strong>0</strong> to <strong>999,990</strong>, all of them being multiples of <strong>10</strong>.

Now, let's try to split these values into <strong>4</strong> ranges:


SELECT  range,
        (MIN(getprev) + MIN(value)) / 2 as range_start,
        (MAX(getnext) + MAX(value)) / 2 AS range_end,
        COUNT(*) as range_values
FROM    (
        SELECT  value,
                NTILE(4) OVER (ORDER BY value) AS range,
                (
                SELECT  TOP 1 value
                FROM    [20090517_ranges].t_random li
                WHERE   li.value < lo.value
                ORDER BY
                        li.value DESC
                ) AS getprev,
                (
                SELECT  TOP 1 value
                FROM    &#91;20090517_ranges&#93;.t_random li
                WHERE   li.value > lo.value
                ORDER BY
                        li.value
                ) AS getnext
        FROM    [20090517_ranges].t_random lo
        ) vo
GROUP BY range
ORDER BY range

range range_start range_end range_values
1 10 242715 250
2 242715 491395 250
3 491395 744735 250
4 744735 998840 250
4 rows fetched in 0.0003s (0.0256s)
Table 't_random'. Scan count 2001, logical reads 4005, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 3, logical reads 2022, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 25 ms. 

We use a subquery here to select the next and the previous value along with each row. Since there is an index on value, these subqueries are very fast.

The previous value is aliased as getprev, the next one as getnext.

For each group then we select a minimal getprev along with minimal value. The minimal getprev is of course the greatest value in the previous group, while the minimal value is the least value in the current range.

Given this, we can use their average (counted as MIN(getprev) + MIN(value) / 2) as the lower bound for the current range.

Similarly, MAX(getnext) + MAX(value) / 2 will give us the upper bound for the current range.

Note that for each next row, previous getnext becomes the current value, and previos value becomes the current getprev.

That's why value + getnext / 2 for the last row in a range will become getprev + value / 2 for the first row in the next range, and the boundaries will always match.

Also there is a nice side effect: for the first row of the first range (that is one with NTILE = 1), getprev will be NULL and will not contribute to the aggregated value.

MIN(getprev) and MIN(value) will be equal for this range, and their average will just yield the first value, which will be the lower bound for the first range.

This is also true for the last value in the last range: it will be the upper bound of the last range.

Since there is an index on value and it is used both for subqueries and for grouping, this query is very efficient and runs for only 25 ms.

Written by Quassnoi

May 17th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply