EXPLAIN EXTENDED

How to create fast database queries

Rounding numbers preserving their sum

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm looking for a way to round every value contained in a column without changing the total sum (which is known to be an integer, or very close to an integer because of float number precision).

Rounding every value to the nearest integer won't work (ex : (1.5, 1.5) will be rounded to either (1, 1) or (2, 2))

It's quite easy to do this using several requests (e. g. storing the original sum, rounding, computing the new sum, and updating as many rows as needed to go back to the original sum), but this is not a very elegant solution.

Is there a way to do this using a single SQL request?

I'm using SQL Server 2008, so solutions taking advantage of this specific vendor are welcome.

I'm looking for a request minimizing the differences between the old values and the new ones. In other words, a value should never be rounded up if a greater value has been rounded down, and vice-versa

We have two restrictions here:

  1. The sum of the rounded numbers should be equal to that of the unrounded ones.
  2. When rounding, no number with a greater fractional part can be rounded down if there is a number with a smaller fractional part that had been rounded up.

We cannot round (1.4, 1.4, 1.2) to (1, 1, 2), since 1.4 would be rounded down while 1.2 (with smaller fractional part) is rounded up. However, we can round them to (1, 2, 1) or (1, 2, 1), since nothing prevents us from rounding numbers with equal fractional part differently.

Given this, we should build a query that would take a set of numbers and round them taking these two restrictions into account.

It is easily seen than to satisfy the second restriction a certain rounding threshold should exist: all numbers with the fractional part greater than this threshold are rounded up, those with the fractional part smaller than then threshold are rounded down, those with the fractional part equal to the threshold can be rounded both ways.

However, calculating this threshold is, first, quite a difficult task and, second, we still don't know what to do with the numbers having the fractional part equal to the threshold.

Instead, we use another approach.

Let's create a sample tables and show this approach:


Table creation details

This script creates a table with 10 random values in it.

The values are multiples of 0.1 with random fractional parts (from 5.0 to 5.4) and their sum is exactly 52:

SELECT  value
FROM    [20090921_round].t_round
value
5.4
5.1
5.2
5.2
5.3
5.0
5.1
5.4
5.1
5.2

To round a value we should apply a rounding function to it which either truncates the fractional part (like FLOOR) or rounds up to the closest greater integer (like CEILING).

Now, what will we have if we apply FLOOR to all values?

SELECT  SUM(FLOOR(value))
FROM    [20090921_round].t_round
50

We see that truncating the fractional part makes the sum to be less by 2. That means that all fractional parts added make 2.

Now, what will be have if we round everything up?

SELECT  SUM(CEILING(value))
FROM    [20090921_round].t_round
59

This makes 59.

Note that we have 10 values in our table and the difference between 59 and 50 is only 9.

This is because there is no difference between the FLOOR and CEILING for the integer numbers (and only for them). If the number is not integer, the difference will be exactly 1.

And it's easy to see that indeed there is exactly 1 integer number in the table. All remaining 9 values are fractional and therefore contribute to the difference.

So we see that there are 9 numbers that affect the final sum being rounded differently. It is also easy to see that each value contribution to the difference amounts exactly to 1.

Now, the difference between SUM(FLOOR(value)) and SUM(value) is 2.

So if we take 2 of the fractional values and CEILING them, then take another 7 and FLOOR them, we will get exactly the original sum.

But which values should we floor and which should we ceil?

Of course we should order the values by their fractional parts in descending order (so that those with greater fractional parts would go first) and CEILING the top ones first. That will implicitly make the ceiling threshold I mentioned earlier and CEILING will be only applied only to the values exceeding the threshold (or, probably, equal to that).

To pick the right values we should utilize analytical function ROW_NUMBER():

First, we calculate the difference between SUM(value) and SUM(FLOOR(VALUE)). This gives us the number of values we should ceil rather than floor to get the correct sum. Let's denote this difference by N. Then we take first N values using ROW_NUMBER() and apply CEILING to them, while applying FLOOR to other.

Here's how do we do it in SQL Server:

SELECT  value,
CASE
WHEN ROW_NUMBER() OVER (ORDER BY value - FLOOR(value) DESC) <= SUM(value) OVER() - SUM(FLOOR(value)) OVER()
        THEN
                CEILING(value)
        ELSE
                FLOOR(value)
        END AS rvalue
FROM    [20090921_round].t_round
value rvalue
5.4 6
5.4 6
5.3 5
5.2 5
5.2 5
5.2 5
5.1 5
5.1 5
5.1 5
5.0 5

We see that CEILING is only applied to the first two values with greatest fractional parts. All other values are rounded down to the nearest smaller integer.

Now, let's check the sum of the values:

SELECT  SUM(rvalue) AS rsum
FROM    (
SELECT  CASE
WHEN ROW_NUMBER() OVER (ORDER BY value - FLOOR(value) DESC) <= SUM(value) OVER() - SUM(FLOOR(value)) OVER()
                THEN
                        CEILING(value)
                ELSE
                        FLOOR(value)
                END AS rvalue
        FROM    [20090921_round].t_round
        ) q
rsum
52

Just as in the original table.

Written by Quassnoi

September 21st, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply