EXPLAIN EXTENDED

How to create fast database queries

Analytic functions: optimizing SUM and ROW_NUMBER

Comments enabled. I *really* need your comment

In the previous articles I wrote about emulating numerous analytic function in MySQL.

Using methods described above, it's possible to emulate almost all analytic functions present in Oracle and SQL Server.

Here are these methods in a nutshell:

  • Select all table rows ordered by PARTITION BY columns, then by ORDER BY columns of the analytic function
  • Track the grouing sets by using session variables initialized in the first subquery
  • If the analytic function needs some precalculations to be evaluated (like, count of the rows in the grouping set, sum of the values etc), join the table with the precalculated aggregates
  • Use state session variables to calculate the analytic function and store intermediate values between rows
  • Initialize state session variables whenever the grouping set changes

This may sound confusing, but if you take a look on the examples from the previous articles, it will become clear as a bell.

This methods work and work well, if you need to select all rows from the tables.

But what if you need to implement some filtering? Do we really need to count millions of rows if we need first three? Do we really need to inspect all rows to find a maximum if we have an index?

Of course, no.

Analytic functions can be optimized as well as any other queries.

Let's begin with the simplest cast of SUM and ROW_NUMBER(). We'll try to emulate this query:

SELECT *
FROM    (
        SELECT  id, grouper, value,
                SUM(value) OVER (PARTITION BY grouper ORDER BY id),
                ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id) AS rn
        FROM t_aggregate
        )
WHERE rn BETWEEN 11 and 20

The query like this is often met on the paged reports. We need to select a partial sum for all columns from 11th to 20th for each grouping set.

For all these functions to work, we need to know the number and the values of the previous rows. But we don't need to know neither the number nor the values of the next rows.

So to optimize the query we need to limit the number of rows to be evaluated to the first 20 rows of each grouping set.

We can do it in the following steps:

  1. Create an index that includes all grouping columns and all ordering columns.
  2. Select DISTINCT grouping sets from the table in a subquery. This will use the index created above
  3. For each grouping set, find the 20th value in a correlated subquery. To find the 20th value, we'll need to use ORDER BY … LIMIT clause. This will also use the index.
  4. Join the main table so that the grouping_set = current_grouping_set AND ordering_set <= 20th_ordering_set. This will also use the index. This join will select 20 rows for each grouping set.
  5. Order the resulting rowset first by grouping columns, then by ordering columns.
  6. Use the session variables to calculate the ROW_NUMBER() and SUM(value)
  7. Filter out the first 10 values

The query:

SELECT  ao.*,
        @s AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)`,
        @r AS rn
FROM    (
        SELECT  @_grouper = -1,
                @r := 0,
                @s := 0
        ) vars,
        (
        SELECT  a.*
        FROM    (
                SELECT  grouper,
                        COALESCE(
                        (
                        SELECT  id
                        FROM    t_aggregator ami
                        WHERE   ami.grouper = ado2.grouper
                        ORDER BY
                                grouper, id
                        LIMIT   19, 1
                        ), CAST(0xFFFFFFFF AS DECIMAL)) AS maxid
                FROM    (
                        SELECT  DISTINCT grouper
                        FROM    t_aggregator ad
                        ) ado2
                ) ado,
                t_aggregator a
        WHERE   a.grouper >= ado.grouper
                AND a.grouper <= ado.grouper
                AND a.id <= ado.maxid
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND CASE WHEN @_grouper <> grouper THEN @s := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
        AND (@s := @s + value) IS NOT NULL
        AND (@r := @r + 1) > 10
id grouper value SUM(value) OVER (PARTITION BY grouper ORDER BY id) rn
21 1 92474 561092 11
23 1 99346 660438 12
25 1 33613 694051 13
27 1 21452 715503 14
29 1 95748 811251 15
31 1 50172 861423 16
33 1 3562 864985 17
35 1 92660 957645 18
37 1 80703 1038348 19
39 1 38944 1077292 20
22 2 13422 712157 11
24 2 39974 752131 12
26 2 69463 821594 13
28 2 69392 890986 14
30 2 39261 930247 15
32 2 80411 1010658 16
34 2 90383 1101041 17
36 2 20257 1121298 18
38 2 68201 1189499 19
40 2 29079 1218578 20
20 rows fetched in 0,0050s (0,0015s)

On a 1,000,000 rows table the query takes several microseconds to complete.

To be continued.

Written by Quassnoi

March 11th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply