## Building alphabetical index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

**Cora** asks:

I want to build an alphabetical index over the articles in my database.

I need to show

5entries in the index, like this:`A-D`

,`E-F`

etc., so that each entry contains roughly equal number of articles.How do I do it in a single query?

This is in

Oracle

This is a good task to demonstrate **Oracle**'s analytic abilities.

To do this, we will create a sample table:

**Table creation details**

This script creates **10,000** random articles with the names distributed logarithmically, according to ETAOIN SHRDLU.

Here's the distribution:

**View the distribution of initial letters**

To split the letters into **5** *tomes* (contiguous letter sets), we should calculate the aggregated sum of the articles starting with each letters, divide it by overall sum of them articles and distribute the letters according to this value.

This can be done with **Oracle**'s ability to calculate partial sums and groupwise sums without actual grouping.

To calculate partial sums, we should use `SUM(cnt) OVER (ORDER BY letter)`

. This will calculate and accumulate the count of articles starting from each letter.

To calculate total sum and assign it to each row, we just use `SUM(cnt) OVER()`

, without `ORDER BY`

clause. This is an equivalent of an aggregate `SUM`

, but is returned for each row.

Here are the results:

**View the partial and total sums and their ratios**

Now we should just take each value and assign it to the closest quotient of **5**: `0.0`

, `0.2`

, `0.4`

etc.

However, there is a problem.

The query above returns the following results for **D** and **E**:

LETTER | CNT | PARTIAL_SUM | TOTAL_SUM | PARTIAL_RATIO |
---|---|---|---|---|

D | 333 | 1796 | 10000 | 0.180 |

E | 1583 | 3379 | 10000 | 0.338 |

`0.0`

means tome **1**, `0.2`

means tome **2**.

`0.2`

is inside the **E** range which is from `0.180`

to `0.338`

.

Should we assign **E** to tome **1** or tome **2**?

Assigning **E** to tome **1** makes it `0.338`

articles long or `0.138`

articles longer than it should be, while assigning to tome **2** makes it only `0.020`

articles shorter.

It seems that the second option is more fair, since the difference is `0.020`

vs. `0.138`

in the first case.

So to calculate the tome the letter should go to, we should minimize the difference.

To do this, we should take the average of the previous and the current value of ratio (in this case, it would be `(0.338 + 0.180) / 2 = 0.259`

) and round it down to the nearest quotient (`0.2`

).

To calculate the difference between the rows, one could use `LAG`

function. However, the analytic function cannot be nested and this would require nesting the whole query.

But there is a more elegant way.

Remember the algorithm the ratios are calculated with? They are the **partial sums** divided by the **total sum**.

The average of two consecutive ratios is the sum of all values but the current one plus the sum of all values (including the current one) divided by two and divided by the total sum:

`fair_ratio = AVG(partial_ratio, previous_partial_ratio) = SUM(1:N-1) + SUM(1:N) / 2 * SUM()`

But the sum of all values including the current one is in fact the sum of all preceding values plus the current value.

So to get the average of the ratios, we can just take the sum of all preceding values **excluding** the current one and add a *half* of the current value to it.

And **Oracle** allows us to calculate the sum of all values but the current one merely by adding the `RANGE`

clause to the analytical function. This expression:

`SUM(cnt) OVER (ORDER BY letter RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) + cnt / 2`

returns the average of the current and the previous partial sums by calculating the previous partial sum and adding half the current value to it.

Let's use it in a query:

**View the query with fair ratios**

We see that for letter **E** the `fair_ratio`

is `0.259`

, which is the average of the `partial_ratios`

from the preceding and the current rows:

LETTER | CNT | PARTIAL_SUM | TOTAL_SUM | PARTIAL_RATIO | FAIR_RATIO |
---|---|---|---|---|---|

D | 333 | 1796 | 10000 | 0.180 | 0.163 |

E | 1583 | 3379 | 10000 | 0.338 | 0.259 |

Now we cal easily distribute the letters between the tomes.

To do this, we should just multiply the `fair_ratio`

by **5** and round it down to the closest integer using `FLOOR`

. This will give us the tome number.

The first and the last letters of each tome can be easily calculated using `MIN`

and `MAX`

.

Here's the query to do this:

SELECT tome, MIN(letter) || '-' || MAX(letter) AS range, SUM(cnt) FROM ( SELECT letter, cnt, TRUNC((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER () * 5) + 1 AS tome FROM ( SELECT SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt FROM t_name GROUP BY SUBSTR(name, 1, 1) ) ) GROUP BY tome ORDER BY tome

TOME | RANGE | SUM(CNT) |
---|---|---|

1 | A-D | 1796 |

2 | E-H | 2308 |

3 | I-N | 2025 |

4 | O-S | 2035 |

5 | T-Z | 1836 |

As you can see, the distribution is reasonably fair.

Hope that helps.

I'm always glad to answer the questions regarding database queries.