# EXPLAIN EXTENDED

How to create fast database queries

## Building alphabetical index

I want to build an alphabetical index over the articles in my database.

I need to show 5 entries in the index, like this: `A-D`, `E-F` etc., so that each entry contains roughly equal number of articles.

How do I do it in a single query?

This is in Oracle

This is a good task to demonstrate Oracle's analytic abilities.

To do this, we will create a sample table:

Table creation details

To split the letters into 5 tomes (contiguous letter sets), we should calculate the aggregated sum of the articles starting with each letters, divide it by overall sum of them articles and distribute the letters according to this value.

This can be done with Oracle's ability to calculate partial sums and groupwise sums without actual grouping.

To calculate partial sums, we should use `SUM(cnt) OVER (ORDER BY letter)`. This will calculate and accumulate the count of articles starting from each letter.

To calculate total sum and assign it to each row, we just use `SUM(cnt) OVER()`, without `ORDER BY` clause. This is an equivalent of an aggregate `SUM`, but is returned for each row.

Here are the results:

Now we should just take each value and assign it to the closest quotient of 5: `0.0`, `0.2`, `0.4` etc.

However, there is a problem.

The query above returns the following results for D and E:

LETTER CNT PARTIAL_SUM TOTAL_SUM PARTIAL_RATIO
D 333 1796 10000 0.180
E 1583 3379 10000 0.338

`0.0` means tome 1, `0.2` means tome 2.

`0.2` is inside the E range which is from `0.180` to `0.338`.

Should we assign E to tome 1 or tome 2?

Assigning E to tome 1 makes it `0.338` articles long or `0.138` articles longer than it should be, while assigning to tome 2 makes it only `0.020` articles shorter.

It seems that the second option is more fair, since the difference is `0.020` vs. `0.138` in the first case.

So to calculate the tome the letter should go to, we should minimize the difference.

To do this, we should take the average of the previous and the current value of ratio (in this case, it would be `(0.338 + 0.180) / 2 = 0.259`) and round it down to the nearest quotient (`0.2`).

To calculate the difference between the rows, one could use `LAG` function. However, the analytic function cannot be nested and this would require nesting the whole query.

But there is a more elegant way.

Remember the algorithm the ratios are calculated with? They are the partial sums divided by the total sum.

The average of two consecutive ratios is the sum of all values but the current one plus the sum of all values (including the current one) divided by two and divided by the total sum:

`fair_ratio = AVG(partial_ratio, previous_partial_ratio) = SUM(1:N-1) + SUM(1:N) / 2 * SUM()`

But the sum of all values including the current one is in fact the sum of all preceding values plus the current value.

So to get the average of the ratios, we can just take the sum of all preceding values excluding the current one and add a half of the current value to it.

And Oracle allows us to calculate the sum of all values but the current one merely by adding the `RANGE` clause to the analytical function. This expression:

`SUM(cnt) OVER (ORDER BY letter RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) + cnt / 2`

returns the average of the current and the previous partial sums by calculating the previous partial sum and adding half the current value to it.

Let's use it in a query:

We see that for letter E the `fair_ratio` is `0.259`, which is the average of the `partial_ratios` from the preceding and the current rows:

LETTER CNT PARTIAL_SUM TOTAL_SUM PARTIAL_RATIO FAIR_RATIO
D 333 1796 10000 0.180 0.163
E 1583 3379 10000 0.338 0.259

Now we cal easily distribute the letters between the tomes.

To do this, we should just multiply the `fair_ratio` by 5 and round it down to the closest integer using `FLOOR`. This will give us the tome number.

The first and the last letters of each tome can be easily calculated using `MIN` and `MAX`.

Here's the query to do this:

```SELECT  tome, MIN(letter) || '-' || MAX(letter) AS range, SUM(cnt)
FROM    (
SELECT  letter, cnt,
TRUNC((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER () * 5) + 1 AS tome
FROM    (
SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
FROM    t_name
GROUP BY
SUBSTR(name, 1, 1)
)
)
GROUP BY
tome
ORDER BY
tome
```
TOME RANGE SUM(CNT)
1 A-D 1796
2 E-H 2308
3 I-N 2025
4 O-S 2035
5 T-Z 1836

As you can see, the distribution is reasonably fair.

Hope that helps.