## Ranging random values

Comments enabled. I *really* need your comment

From **Stack Overflow**:

Given a set of random numeric values in an

SQL Serverdatabase, how do I generate a limited list of ranges where each range contains at least one value?The ranges should not overlap and ideally have a similar amount of values in them.

This is a good job for **SQL Server**'s analytic function `NTILE`

.

This function, given a parameter `N`

, assigns a rank from `1`

to `N`

to each value, just as in *quadrile* (**4**), *quintile* (**5**), *percentile* (**100**) etc.

Lower `1/N`

of values get a rank of `1`

, from `1/N`

to `2/N`

get a rank of **2**, etc.

The value, though, are integer and random, and we can get gaps between the adjacent ranges.

To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.

Say, `NTILE`

assigned rank **1** to values from **100** to **200**, rank **2** to values from **210** to **300**, and rank **3** to values from **320** to **400**.

Then we will use **205** as a common bound for the first two ranges, and **310** as a commond bound for the next two.

To do this, for each value selected, we will get also the previous one and the next one.

It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.

Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.

Let's create a sample table and see how it works:

SET NOCOUNT ON GO DROP TABLE [20090517_ranges].t_random DROP SCHEMA [20090517_ranges] GO CREATE SCHEMA [20090517_ranges] CREATE TABLE t_random ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL ) GO CREATE INDEX IX_random_value ON [20090517_ranges].t_random (value DESC) GO DECLARE @cnt INT DECLARE @rand INT SELECT RAND(20090517) BEGIN TRANSACTION SET @cnt = 1 WHILE @cnt <= 1000 BEGIN SET @rand = FLOOR(RAND() * 100000) * 10 INSERT INTO [20090517_ranges].t_random (id, value) VALUES (@cnt, @rand) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] The table has <strong>1,000</strong> records with values randomly distributed from <strong>0</strong> to <strong>999,990</strong>, all of them being multiples of <strong>10</strong>. Now, let's try to split these values into <strong>4</strong> ranges: SELECT range, (MIN(getprev) + MIN(value)) / 2 as range_start, (MAX(getnext) + MAX(value)) / 2 AS range_end, COUNT(*) as range_values FROM ( SELECT value, NTILE(4) OVER (ORDER BY value) AS range, ( SELECT TOP 1 value FROM [20090517_ranges].t_random li WHERE li.value < lo.value ORDER BY li.value DESC ) AS getprev, ( SELECT TOP 1 value FROM [20090517_ranges].t_random li WHERE li.value > lo.value ORDER BY li.value ) AS getnext FROM [20090517_ranges].t_random lo ) vo GROUP BY range ORDER BY range

range | range_start | range_end | range_values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | 242715 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2 | 242715 | 491395 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

3 | 491395 | 744735 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

4 | 744735 | 998840 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

4 rows fetched in 0.0003s (0.0256s) |

Table 't_random'. Scan count 2001, logical reads 4005, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 3, logical reads 2022, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 25 ms.

We use a subquery here to select the next and the previous value along with each row. Since there is an index on `value`

, these subqueries are very fast.

The previous value is aliased as `getprev`

, the next one as `getnext`

.

For each group then we select a minimal `getprev`

along with minimal `value`

. The minimal `getprev`

is of course the greatest `value`

in the previous group, while the minimal `value`

is the least value in the current range.

Given this, we can use their average (counted as `MIN(getprev) + MIN(value) / 2`

) as the lower bound for the current range.

Similarly, `MAX(getnext) + MAX(value) / 2`

will give us the upper bound for the current range.

Note that for each next row, previous `getnext`

becomes the current `value`

, and previos `value`

becomes the current `getprev`

.

That's why `value + getnext / 2`

for the last row in a range will become `getprev + value / 2`

for the first row in the next range, and the boundaries will always match.

Also there is a nice side effect: for the first row of the first range (that is one with `NTILE = 1`

), `getprev`

will be `NULL`

and will not contribute to the aggregated value.

`MIN(getprev)`

and `MIN(value)`

will be equal for this range, and their average will just yield the first `value`

, which will be the lower bound for the first range.

This is also true for the last value in the last range: it will be the upper bound of the last range.

Since there is an index on `value`

and it is used both for subqueries and for grouping, this query is very efficient and runs for only **25** ms.