EXPLAIN EXTENDED

How to create fast database queries

Building alphabetical index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Cora asks:

I want to build an alphabetical index over the articles in my database.

I need to show 5 entries in the index, like this: A-D, E-F etc., so that each entry contains roughly equal number of articles.

How do I do it in a single query?

This is in Oracle

This is a good task to demonstrate Oracle's analytic abilities.

To do this, we will create a sample table:

Table creation details

This script creates 10,000 random articles with the names distributed logarithmically, according to ETAOIN SHRDLU.

Here's the distribution:

View the distribution of initial letters

To split the letters into 5 tomes (contiguous letter sets), we should calculate the aggregated sum of the articles starting with each letters, divide it by overall sum of them articles and distribute the letters according to this value.

This can be done with Oracle's ability to calculate partial sums and groupwise sums without actual grouping.

To calculate partial sums, we should use SUM(cnt) OVER (ORDER BY letter). This will calculate and accumulate the count of articles starting from each letter.

To calculate total sum and assign it to each row, we just use SUM(cnt) OVER(), without ORDER BY clause. This is an equivalent of an aggregate SUM, but is returned for each row.

Here are the results:

View the partial and total sums and their ratios

Now we should just take each value and assign it to the closest quotient of 5: 0.0, 0.2, 0.4 etc.

However, there is a problem.

The query above returns the following results for D and E:

LETTER CNT PARTIAL_SUM TOTAL_SUM PARTIAL_RATIO
D 333 1796 10000 0.180
E 1583 3379 10000 0.338

0.0 means tome 1, 0.2 means tome 2.

0.2 is inside the E range which is from 0.180 to 0.338.

Should we assign E to tome 1 or tome 2?

Assigning E to tome 1 makes it 0.338 articles long or 0.138 articles longer than it should be, while assigning to tome 2 makes it only 0.020 articles shorter.

It seems that the second option is more fair, since the difference is 0.020 vs. 0.138 in the first case.

So to calculate the tome the letter should go to, we should minimize the difference.

To do this, we should take the average of the previous and the current value of ratio (in this case, it would be (0.338 + 0.180) / 2 = 0.259) and round it down to the nearest quotient (0.2).

To calculate the difference between the rows, one could use LAG function. However, the analytic function cannot be nested and this would require nesting the whole query.

But there is a more elegant way.

Remember the algorithm the ratios are calculated with? They are the partial sums divided by the total sum.

The average of two consecutive ratios is the sum of all values but the current one plus the sum of all values (including the current one) divided by two and divided by the total sum:

fair_ratio = AVG(partial_ratio, previous_partial_ratio) = SUM(1:N-1) + SUM(1:N) / 2 * SUM()

But the sum of all values including the current one is in fact the sum of all preceding values plus the current value.

So to get the average of the ratios, we can just take the sum of all preceding values excluding the current one and add a half of the current value to it.

And Oracle allows us to calculate the sum of all values but the current one merely by adding the RANGE clause to the analytical function. This expression:

SUM(cnt) OVER (ORDER BY letter RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) + cnt / 2

returns the average of the current and the previous partial sums by calculating the previous partial sum and adding half the current value to it.

Let's use it in a query:

View the query with fair ratios

We see that for letter E the fair_ratio is 0.259, which is the average of the partial_ratios from the preceding and the current rows:

LETTER CNT PARTIAL_SUM TOTAL_SUM PARTIAL_RATIO FAIR_RATIO
D 333 1796 10000 0.180 0.163
E 1583 3379 10000 0.338 0.259

Now we cal easily distribute the letters between the tomes.

To do this, we should just multiply the fair_ratio by 5 and round it down to the closest integer using FLOOR. This will give us the tome number.

The first and the last letters of each tome can be easily calculated using MIN and MAX.

Here's the query to do this:

SELECT  tome, MIN(letter) || '-' || MAX(letter) AS range, SUM(cnt)
FROM    (
        SELECT  letter, cnt,
                TRUNC((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER () * 5) + 1 AS tome
        FROM    (
                SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
                FROM    t_name
                GROUP BY
                        SUBSTR(name, 1, 1)
                )
        )
GROUP BY
        tome
ORDER BY
        tome
TOME RANGE SUM(CNT)
1 A-D 1796
2 E-H 2308
3 I-N 2025
4 O-S 2035
5 T-Z 1836

As you can see, the distribution is reasonably fair.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 23rd, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply