Building alphabetical index
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Cora asks:
I want to build an alphabetical index over the articles in my database.
I need to show 5 entries in the index, like this:
A-D
,E-F
etc., so that each entry contains roughly equal number of articles.How do I do it in a single query?
This is in Oracle
This is a good task to demonstrate Oracle's analytic abilities.
To do this, we will create a sample table:
Table creation details
To split the letters into 5 tomes (contiguous letter sets), we should calculate the aggregated sum of the articles starting with each letters, divide it by overall sum of them articles and distribute the letters according to this value.
This can be done with Oracle's ability to calculate partial sums and groupwise sums without actual grouping.
To calculate partial sums, we should use SUM(cnt) OVER (ORDER BY letter)
. This will calculate and accumulate the count of articles starting from each letter.
To calculate total sum and assign it to each row, we just use SUM(cnt) OVER()
, without ORDER BY
clause. This is an equivalent of an aggregate SUM
, but is returned for each row.
Here are the results:
View the partial and total sums and their ratios
Now we should just take each value and assign it to the closest quotient of 5: 0.0
, 0.2
, 0.4
etc.
However, there is a problem.
The query above returns the following results for D and E:
LETTER | CNT | PARTIAL_SUM | TOTAL_SUM | PARTIAL_RATIO |
---|---|---|---|---|
D | 333 | 1796 | 10000 | 0.180 |
E | 1583 | 3379 | 10000 | 0.338 |
0.0
means tome 1, 0.2
means tome 2.
0.2
is inside the E range which is from 0.180
to 0.338
.
Should we assign E to tome 1 or tome 2?
Assigning E to tome 1 makes it 0.338
articles long or 0.138
articles longer than it should be, while assigning to tome 2 makes it only 0.020
articles shorter.
It seems that the second option is more fair, since the difference is 0.020
vs. 0.138
in the first case.
So to calculate the tome the letter should go to, we should minimize the difference.
To do this, we should take the average of the previous and the current value of ratio (in this case, it would be (0.338 + 0.180) / 2 = 0.259
) and round it down to the nearest quotient (0.2
).
To calculate the difference between the rows, one could use LAG
function. However, the analytic function cannot be nested and this would require nesting the whole query.
But there is a more elegant way.
Remember the algorithm the ratios are calculated with? They are the partial sums divided by the total sum.
The average of two consecutive ratios is the sum of all values but the current one plus the sum of all values (including the current one) divided by two and divided by the total sum:
fair_ratio = AVG(partial_ratio, previous_partial_ratio) = SUM(1:N-1) + SUM(1:N) / 2 * SUM()
But the sum of all values including the current one is in fact the sum of all preceding values plus the current value.
So to get the average of the ratios, we can just take the sum of all preceding values excluding the current one and add a half of the current value to it.
And Oracle allows us to calculate the sum of all values but the current one merely by adding the RANGE
clause to the analytical function. This expression:
SUM(cnt) OVER (ORDER BY letter RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) + cnt / 2
returns the average of the current and the previous partial sums by calculating the previous partial sum and adding half the current value to it.
Let's use it in a query:
View the query with fair ratios
We see that for letter E the fair_ratio
is 0.259
, which is the average of the partial_ratios
from the preceding and the current rows:
LETTER | CNT | PARTIAL_SUM | TOTAL_SUM | PARTIAL_RATIO | FAIR_RATIO |
---|---|---|---|---|---|
D | 333 | 1796 | 10000 | 0.180 | 0.163 |
E | 1583 | 3379 | 10000 | 0.338 | 0.259 |
Now we cal easily distribute the letters between the tomes.
To do this, we should just multiply the fair_ratio
by 5 and round it down to the closest integer using FLOOR
. This will give us the tome number.
The first and the last letters of each tome can be easily calculated using MIN
and MAX
.
Here's the query to do this:
SELECT tome, MIN(letter) || '-' || MAX(letter) AS range, SUM(cnt) FROM ( SELECT letter, cnt, TRUNC((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER () * 5) + 1 AS tome FROM ( SELECT SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt FROM t_name GROUP BY SUBSTR(name, 1, 1) ) ) GROUP BY tome ORDER BY tome
TOME | RANGE | SUM(CNT) |
---|---|---|
1 | A-D | 1796 |
2 | E-H | 2308 |
3 | I-N | 2025 |
4 | O-S | 2035 |
5 | T-Z | 1836 |
As you can see, the distribution is reasonably fair.
Hope that helps.
I'm always glad to answer the questions regarding database queries.