# EXPLAIN EXTENDED

How to create fast database queries

## Ranging random values

From Stack Overflow:

Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?

The ranges should not overlap and ideally have a similar amount of values in them.

This is a good job for SQL Server's analytic function `NTILE`.

This function, given a parameter `N`, assigns a rank from `1` to `N` to each value, just as in quadrile (4), quintile (5), percentile (100) etc.

Lower `1/N` of values get a rank of `1`, from `1/N` to `2/N` get a rank of 2, etc.

The value, though, are integer and random, and we can get gaps between the adjacent ranges.

To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.

Say, `NTILE` assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.

Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.

To do this, for each value selected, we will get also the previous one and the next one.

It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.

Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.

Let's create a sample table and see how it works:

```SET NOCOUNT ON
GO
DROP TABLE [20090517_ranges].t_random
DROP SCHEMA [20090517_ranges]
GO
CREATE SCHEMA [20090517_ranges]
CREATE TABLE t_random (
id INT NOT NULL PRIMARY KEY,
value INT NOT NULL
)
GO
CREATE INDEX IX_random_value ON [20090517_ranges].t_random (value DESC)
GO
DECLARE @cnt INT
DECLARE @rand INT
SELECT  RAND(20090517)
BEGIN TRANSACTION
SET @cnt = 1
WHILE @cnt <= 1000
BEGIN
SET @rand = FLOOR(RAND() * 100000) * 10
INSERT
INTO    &#91;20090517_ranges&#93;.t_random (id, value)
VALUES  (@cnt, @rand)
SET @cnt = @cnt + 1
END

COMMIT
GO
&#91;/sourcecode&#93;

The table has <strong>1,000</strong> records with values randomly distributed from <strong>0</strong> to <strong>999,990</strong>, all of them being multiples of <strong>10</strong>.

Now, let's try to split these values into <strong>4</strong> ranges:

SELECT  range,
(MIN(getprev) + MIN(value)) / 2 as range_start,
(MAX(getnext) + MAX(value)) / 2 AS range_end,
COUNT(*) as range_values
FROM    (
SELECT  value,
NTILE(4) OVER (ORDER BY value) AS range,
(
SELECT  TOP 1 value
FROM    [20090517_ranges].t_random li
WHERE   li.value < lo.value
ORDER BY
li.value DESC
) AS getprev,
(
SELECT  TOP 1 value
FROM    &#91;20090517_ranges&#93;.t_random li
WHERE   li.value > lo.value
ORDER BY
li.value
) AS getnext
FROM    [20090517_ranges].t_random lo
) vo
GROUP BY range
ORDER BY range

```
range range_start range_end range_values
1 10 242715 250
2 242715 491395 250
3 491395 744735 250
4 744735 998840 250
4 rows fetched in 0.0003s (0.0256s)
```Table 't_random'. Scan count 2001, logical reads 4005, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 25 ms.
```

We use a subquery here to select the next and the previous value along with each row. Since there is an index on `value`, these subqueries are very fast.

The previous value is aliased as `getprev`, the next one as `getnext`.

For each group then we select a minimal `getprev` along with minimal `value`. The minimal `getprev` is of course the greatest `value` in the previous group, while the minimal `value` is the least value in the current range.

Given this, we can use their average (counted as `MIN(getprev) + MIN(value) / 2`) as the lower bound for the current range.

Similarly, `MAX(getnext) + MAX(value) / 2` will give us the upper bound for the current range.

Note that for each next row, previous `getnext` becomes the current `value`, and previos `value` becomes the current `getprev`.

That's why `value + getnext / 2` for the last row in a range will become `getprev + value / 2` for the first row in the next range, and the boundaries will always match.

Also there is a nice side effect: for the first row of the first range (that is one with `NTILE = 1`), `getprev` will be `NULL` and will not contribute to the aggregated value.

`MIN(getprev)` and `MIN(value)` will be equal for this range, and their average will just yield the first `value`, which will be the lower bound for the first range.

This is also true for the last value in the last range: it will be the upper bound of the last range.

Since there is an index on `value` and it is used both for subqueries and for grouping, this query is very efficient and runs for only 25 ms.

Written by Quassnoi

May 17th, 2009 at 11:00 pm

Posted in SQL Server