Building alphabetical index at EXPLAIN EXTENDED

Building alphabetical index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Cora asks:

I want to build an alphabetical index over the articles in my database.

I need to show 5 entries in the index, like this: A-D, E-F etc., so that each entry contains roughly equal number of articles.

How do I do it in a single query?

This is in Oracle

This is a good task to demonstrate Oracle's analytic abilities.

To do this, we will create a sample table:

Table creation details

BEGIN
        DBMS_RANDOM.seed(20090923);
END;
/
CREATE TABLE t_name (id NOT NULL PRIMARY KEY, name NOT NULL)
AS
SELECT  level, CAST(SUBSTR('ETAOINSHRDLUCMFWYPVBGKQJXZ', 1 + FLOOR(LOG(0.85, POWER(0.85, 26) + DBMS_RANDOM.value() * (1 - POWER(0.85, 26)))), 1) || ' article ' || level AS VARCHAR(50)) AS name
FROM    dual
CONNECT BY
        level <= 10000
/
CREATE INDEX ix_name_name ON t_name (name)
/
&#91;/sourcecode&#93;
</div>

This script creates <strong>10,000</strong> random articles with the names distributed logarithmically, according to <a href="http://en.wikipedia.org/wiki/ETAOIN_SHRDLU">ETAOIN SHRDLU</a>.

Here's the distribution:

<a href="#" onclick="xcollapse('X3539');return false;"><strong>View the distribution of initial letters</strong></a>
<br />
<div id="X3539" style="display: none; ">

SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
FROM    t_name
GROUP BY
        SUBSTR(name, 1, 1)
ORDER BY
        cnt DESC

LETTER	CNT
E	1583
T	1213
A	1149
O	908
I	757
N	679
S	570
H	505
R	423
D	333
L	316
U	264
C	236
M	180
F	166
W	122
Y	117
P	98
B	78
V	65
G	54
K	52
J	41
Q	36
X	32
Z	23

To split the letters into 5 tomes (contiguous letter sets), we should calculate the aggregated sum of the articles starting with each letters, divide it by overall sum of them articles and distribute the letters according to this value.

This can be done with Oracle's ability to calculate partial sums and groupwise sums without actual grouping.

To calculate partial sums, we should use SUM(cnt) OVER (ORDER BY letter). This will calculate and accumulate the count of articles starting from each letter.

To calculate total sum and assign it to each row, we just use SUM(cnt) OVER(), without ORDER BY clause. This is an equivalent of an aggregate SUM, but is returned for each row.

Here are the results:

View the partial and total sums and their ratios

SELECT  letter, cnt,
        SUM(cnt) OVER (ORDER BY letter) AS partial_sum,
        SUM(cnt) OVER () AS total_sum,
        TO_CHAR(SUM(cnt) OVER (ORDER BY letter) / SUM(cnt) OVER (), 'FM0.990') AS partial_ratio
FROM    (
        SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
        FROM    t_name
        GROUP BY
                SUBSTR(name, 1, 1)
        )

LETTER	CNT	PARTIAL_SUM	TOTAL_SUM	PARTIAL_RATIO
A	1149	1149	10000	0.115
B	78	1227	10000	0.123
C	236	1463	10000	0.146
D	333	1796	10000	0.180
E	1583	3379	10000	0.338
F	166	3545	10000	0.355
G	54	3599	10000	0.360
H	505	4104	10000	0.410
I	757	4861	10000	0.486
J	41	4902	10000	0.490
K	52	4954	10000	0.495
L	316	5270	10000	0.527
M	180	5450	10000	0.545
N	679	6129	10000	0.613
O	908	7037	10000	0.704
P	98	7135	10000	0.714
Q	36	7171	10000	0.717
R	423	7594	10000	0.759
S	570	8164	10000	0.816
T	1213	9377	10000	0.938
U	264	9641	10000	0.964
V	65	9706	10000	0.971
W	122	9828	10000	0.983
X	32	9860	10000	0.986
Y	117	9977	10000	0.998
Z	23	10000	10000	1.000

Now we should just take each value and assign it to the closest quotient of 5: 0.0, 0.2, 0.4 etc.

However, there is a problem.

The query above returns the following results for D and E:

LETTER	CNT	PARTIAL_SUM	TOTAL_SUM	PARTIAL_RATIO
D	333	1796	10000	0.180
E	1583	3379	10000	0.338

0.0 means tome 1, 0.2 means tome 2.

0.2 is inside the E range which is from 0.180 to 0.338.

Should we assign E to tome 1 or tome 2?

Assigning E to tome 1 makes it 0.338 articles long or 0.138 articles longer than it should be, while assigning to tome 2 makes it only 0.020 articles shorter.

It seems that the second option is more fair, since the difference is 0.020 vs. 0.138 in the first case.

So to calculate the tome the letter should go to, we should minimize the difference.

To do this, we should take the average of the previous and the current value of ratio (in this case, it would be (0.338 + 0.180) / 2 = 0.259) and round it down to the nearest quotient (0.2).

To calculate the difference between the rows, one could use LAG function. However, the analytic function cannot be nested and this would require nesting the whole query.

But there is a more elegant way.

Remember the algorithm the ratios are calculated with? They are the partial sums divided by the total sum.

The average of two consecutive ratios is the sum of all values but the current one plus the sum of all values (including the current one) divided by two and divided by the total sum:

fair_ratio = AVG(partial_ratio, previous_partial_ratio) = SUM(1:N-1) + SUM(1:N) / 2 * SUM()

But the sum of all values including the current one is in fact the sum of all preceding values plus the current value.

So to get the average of the ratios, we can just take the sum of all preceding values excluding the current one and add a half of the current value to it.

And Oracle allows us to calculate the sum of all values but the current one merely by adding the RANGE clause to the analytical function. This expression:

SUM(cnt) OVER (ORDER BY letter RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) + cnt / 2

returns the average of the current and the previous partial sums by calculating the previous partial sum and adding half the current value to it.

Let's use it in a query:

View the query with fair ratios

SELECT  letter, cnt,
        SUM(cnt) OVER (ORDER BY letter) AS partial_sum,
        SUM(cnt) OVER () AS total_sum,
        TO_CHAR(SUM(cnt) OVER (ORDER BY letter) / SUM(cnt) OVER (), 'FM0.990') AS partial_ratio,
        TO_CHAR((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER (), 'FM0.990') AS fair_ratio
FROM    (
        SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
        FROM    t_name
        GROUP BY
                SUBSTR(name, 1, 1)
        )

LETTER	CNT	PARTIAL_SUM	TOTAL_SUM	PARTIAL_RATIO	FAIR_RATIO
A	1149	1149	10000	0.115	0.057
B	78	1227	10000	0.123	0.119
C	236	1463	10000	0.146	0.135
D	333	1796	10000	0.180	0.163
E	1583	3379	10000	0.338	0.259
F	166	3545	10000	0.355	0.346
G	54	3599	10000	0.360	0.357
H	505	4104	10000	0.410	0.385
I	757	4861	10000	0.486	0.448
J	41	4902	10000	0.490	0.488
K	52	4954	10000	0.495	0.493
L	316	5270	10000	0.527	0.511
M	180	5450	10000	0.545	0.536
N	679	6129	10000	0.613	0.579
O	908	7037	10000	0.704	0.658
P	98	7135	10000	0.714	0.709
Q	36	7171	10000	0.717	0.715
R	423	7594	10000	0.759	0.738
S	570	8164	10000	0.816	0.788
T	1213	9377	10000	0.938	0.877
U	264	9641	10000	0.964	0.951
V	65	9706	10000	0.971	0.967
W	122	9828	10000	0.983	0.977
X	32	9860	10000	0.986	0.984
Y	117	9977	10000	0.998	0.992
Z	23	10000	10000	1.000	0.999

We see that for letter E the fair_ratio is 0.259, which is the average of the partial_ratios from the preceding and the current rows:

LETTER	CNT	PARTIAL_SUM	TOTAL_SUM	PARTIAL_RATIO	FAIR_RATIO
D	333	1796	10000	0.180	0.163
E	1583	3379	10000	0.338	0.259

Now we cal easily distribute the letters between the tomes.

To do this, we should just multiply the fair_ratio by 5 and round it down to the closest integer using FLOOR. This will give us the tome number.

The first and the last letters of each tome can be easily calculated using MIN and MAX.

Here's the query to do this:

SELECT  tome, MIN(letter) || '-' || MAX(letter) AS range, SUM(cnt)
FROM    (
        SELECT  letter, cnt,
                TRUNC((COALESCE(SUM(cnt) OVER (ORDER BY letter ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + cnt / 2) / SUM(cnt) OVER () * 5) + 1 AS tome
        FROM    (
                SELECT  SUBSTR(name, 1, 1) AS letter, COUNT(*) AS cnt
                FROM    t_name
                GROUP BY
                        SUBSTR(name, 1, 1)
                )
        )
GROUP BY
        tome
ORDER BY
        tome

TOME	RANGE	SUM(CNT)
1	A-D	1796
2	E-H	2308
3	I-N	2025
4	O-S	2035
5	T-Z	1836

As you can see, the distribution is reasonably fair.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 23rd, 2009 at 11:00 pm

Posted in Oracle

EXPLAIN EXTENDED