Analytic functions: optimizing SUM and ROW_NUMBER

Comments enabled. I *really* need your comment

In the previous articles I wrote about emulating numerous analytic function in MySQL.

Using methods described above, it's possible to emulate almost all analytic functions present in Oracle and SQL Server.

Here are these methods in a nutshell:

Select all table rows ordered by PARTITION BY columns, then by ORDER BY columns of the analytic function
Track the grouing sets by using session variables initialized in the first subquery
If the analytic function needs some precalculations to be evaluated (like, count of the rows in the grouping set, sum of the values etc), join the table with the precalculated aggregates
Use state session variables to calculate the analytic function and store intermediate values between rows
Initialize state session variables whenever the grouping set changes

This may sound confusing, but if you take a look on the examples from the previous articles, it will become clear as a bell.

This methods work and work well, if you need to select all rows from the tables.

But what if you need to implement some filtering? Do we really need to count millions of rows if we need first three? Do we really need to inspect all rows to find a maximum if we have an index?

Of course, no.

Analytic functions can be optimized as well as any other queries.

Let's begin with the simplest cast of SUM and ROW_NUMBER(). We'll try to emulate this query:

SELECT *
FROM    (
SELECT  id, grouper, value,
SUM(value) OVER (PARTITION BY grouper ORDER BY id),
ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id) AS rn
FROM t_aggregate
)
WHERE rn BETWEEN 11 and 20

The query like this is often met on the paged reports. We need to select a partial sum for all columns from 11th to 20th for each grouping set.

For all these functions to work, we need to know the number and the values of the previous rows. But we don't need to know neither the number nor the values of the next rows.

So to optimize the query we need to limit the number of rows to be evaluated to the first 20 rows of each grouping set.

We can do it in the following steps:

Create an index that includes all grouping columns and all ordering columns.
Select DISTINCT grouping sets from the table in a subquery. This will use the index created above
For each grouping set, find the 20th value in a correlated subquery. To find the 20th value, we'll need to use ORDER BY … LIMIT clause. This will also use the index.
Join the main table so that the grouping_set = current_grouping_set AND ordering_set <= 20th_ordering_set. This will also use the index. This join will select 20 rows for each grouping set.
Order the resulting rowset first by grouping columns, then by ordering columns.
Use the session variables to calculate the ROW_NUMBER() and SUM(value)
Filter out the first 10 values

The query:

SELECT  ao.*,
@s AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)`,
@r AS rn
FROM    (
SELECT  @_grouper = -1,
@r := 0,
@s := 0
) vars,
(
SELECT  a.*
FROM    (
SELECT  grouper,
COALESCE(
(
SELECT  id
FROM    t_aggregator ami
WHERE   ami.grouper = ado2.grouper
ORDER BY
grouper, id
LIMIT   19, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS maxid
FROM    (
SELECT  DISTINCT grouper
FROM    t_aggregator ad
) ado2
) ado,
t_aggregator a
WHERE   a.grouper &gt;= ado.grouper
AND a.grouper &lt;= ado.grouper
                AND a.id &lt;= ado.maxid
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper &lt;&gt; grouper THEN @r := 0 ELSE 0 END IS NOT NULL
AND CASE WHEN @_grouper &lt;&gt; grouper THEN @s := 0 ELSE 0 END IS NOT NULL
AND (@_grouper := grouper) IS NOT NULL
AND (@s := @s + value) IS NOT NULL
AND (@r := @r + 1) &gt; 10

id	grouper	value	SUM(value) OVER (PARTITION BY grouper ORDER BY id)	rn
21	1	92474	561092	11
23	1	99346	660438	12
25	1	33613	694051	13
27	1	21452	715503	14
29	1	95748	811251	15
31	1	50172	861423	16
33	1	3562	864985	17
35	1	92660	957645	18
37	1	80703	1038348	19
39	1	38944	1077292	20
22	2	13422	712157	11
24	2	39974	752131	12
26	2	69463	821594	13
28	2	69392	890986	14
30	2	39261	930247	15
32	2	80411	1010658	16
34	2	90383	1101041	17
36	2	20257	1121298	18
38	2	68201	1189499	19
40	2	29079	1218578	20
20 rows fetched in 0,0050s (0,0015s)

On a 1,000,000 rows table the query takes several microseconds to complete.

To be continued.

Written by Quassnoi

March 11th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED