Ranging random values
Comments enabled. I *really* need your comment
From Stack Overflow:
Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?
The ranges should not overlap and ideally have a similar amount of values in them.
This is a good job for SQL Server's analytic function NTILE
.
This function, given a parameter N
, assigns a rank from 1
to N
to each value, just as in quadrile (4), quintile (5), percentile (100) etc.
Lower 1/N
of values get a rank of 1
, from 1/N
to 2/N
get a rank of 2, etc.
The value, though, are integer and random, and we can get gaps between the adjacent ranges.
To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.
Say, NTILE
assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.
Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.
To do this, for each value selected, we will get also the previous one and the next one.
It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.
Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.
Let's create a sample table and see how it works:
SET NOCOUNT ON GO DROP TABLE [20090517_ranges].t_random DROP SCHEMA [20090517_ranges] GO CREATE SCHEMA [20090517_ranges] CREATE TABLE t_random ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL ) GO CREATE INDEX IX_random_value ON [20090517_ranges].t_random (value DESC) GO DECLARE @cnt INT DECLARE @rand INT SELECT RAND(20090517) BEGIN TRANSACTION SET @cnt = 1 WHILE @cnt <= 1000 BEGIN SET @rand = FLOOR(RAND() * 100000) * 10 INSERT INTO [20090517_ranges].t_random (id, value) VALUES (@cnt, @rand) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] The table has <strong>1,000</strong> records with values randomly distributed from <strong>0</strong> to <strong>999,990</strong>, all of them being multiples of <strong>10</strong>. Now, let's try to split these values into <strong>4</strong> ranges: SELECT range, (MIN(getprev) + MIN(value)) / 2 as range_start, (MAX(getnext) + MAX(value)) / 2 AS range_end, COUNT(*) as range_values FROM ( SELECT value, NTILE(4) OVER (ORDER BY value) AS range, ( SELECT TOP 1 value FROM [20090517_ranges].t_random li WHERE li.value < lo.value ORDER BY li.value DESC ) AS getprev, ( SELECT TOP 1 value FROM [20090517_ranges].t_random li WHERE li.value > lo.value ORDER BY li.value ) AS getnext FROM [20090517_ranges].t_random lo ) vo GROUP BY range ORDER BY range
range | range_start | range_end | range_values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | 242715 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 242715 | 491395 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 491395 | 744735 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 744735 | 998840 | 250 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 rows fetched in 0.0003s (0.0256s) |
Table 't_random'. Scan count 2001, logical reads 4005, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 3, logical reads 2022, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 25 ms.
We use a subquery here to select the next and the previous value along with each row. Since there is an index on value
, these subqueries are very fast.
The previous value is aliased as getprev
, the next one as getnext
.
For each group then we select a minimal getprev
along with minimal value
. The minimal getprev
is of course the greatest value
in the previous group, while the minimal value
is the least value in the current range.
Given this, we can use their average (counted as MIN(getprev) + MIN(value) / 2
) as the lower bound for the current range.
Similarly, MAX(getnext) + MAX(value) / 2
will give us the upper bound for the current range.
Note that for each next row, previous getnext
becomes the current value
, and previos value
becomes the current getprev
.
That's why value + getnext / 2
for the last row in a range will become getprev + value / 2
for the first row in the next range, and the boundaries will always match.
Also there is a nice side effect: for the first row of the first range (that is one with NTILE = 1
), getprev
will be NULL
and will not contribute to the aggregated value.
MIN(getprev)
and MIN(value)
will be equal for this range, and their average will just yield the first value
, which will be the lower bound for the first range.
This is also true for the last value in the last range: it will be the upper bound of the last range.
Since there is an index on value
and it is used both for subqueries and for grouping, this query is very efficient and runs for only 25 ms.