EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Analytic functions: SUM, AVG, ROW_NUMBER

Comments enabled. I *really* need your comment

In one of the previous articles I wrote about emulating some of analytic functions in MySQL.

Now, I'd like to cover this question more extensively.

A quick reminder: an analytic function is a function that behaves like an aggregate function with one exception: aggregate function returns one last row for each aggregated set, while an analytic function returns intermediate results too.

An analytic function can be made out of almost all aggregate functions by adding keyword OVER to them with two additional clauses: PARTITION BY and ORDER BY.

PARTITION BY is analog of GROUP BY. ORDER BY defines order in which the intermediate rows will be evaluated.

The behaviour of analytic functions can probably be best illustrated with an example:

SELECT *
FROM    t_aggregator
id grouper value
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 4
8 2 9
9 2 16
10 2 25

This is aggregate function SUM. It outputs a sum of all value's having the same grouper.

SELECT  grouper, SUM(value)
FROM    t_aggregator
GROUP BY
        grouper
grouper SUM(value)
1 15
2 55

We can convert it into analytic function SUM by adding keyword OVER:

SELECT  grouper, SUM(value) OVER (PARTITION BY grouper ORDER BY id)
FROM    t_aggregator
id grouper SUM(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 3
3 1 6
4 1 10
5 1 15
6 2 1
7 2 5
8 2 14
9 2 30
10 2 55

As you can see, analytic SUM outputs intermediate results for each row: a sum of all value's from 1st to 1st row, then from 1st to 2nd, etc. The intermediate result is being reset as the grouper changes.

Values of the last row in each grouper are the sum of all value's from this grouper. It's exactly what aggregate SUM returns.

MySQL lacks these functions, so we need to emulate them somehow if we want to use their functionality.

Today I'll write about emulating three analytic functions: SUM, AVG and ROW_NUMBER(). ROW_NUMBER() is analytic analog of aggregate function COUNT(*). It has other name to be more descriptive, as it returns count of rows selected so far, which in fact is just a row number.

These function are simplest to emulate as their intermediate values are defined by (and only by) previous fields. In our example there are 5 rows for eacher grouper, but if you need a sum, average or count of the first two, you don't need to know other three.

To emulate the functions, we will need to use session variables to keep the intermediate data.

ROW_NUMBER() is probably the simplest of these functions to emulate, as it doesn't even need the column values:

SELECT  id, grouper, (@r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5

We use two session variables here: @_grouper and @r. @r is used to store intermediate data, while @_grouper is used to reset @r when we enter another group.

@_grouper is initialized to a value that cannot equal to any of real grouper's. I chose literal 'N' in this case, as grouper is declared as unsigned int.

@r is initialized to 0 in the WHERE clause each time a grouper changes, this is when the current value of grouper is not equal to previous (or initial) value of @_grouper. The @_grouper then is being set to the current value of grouper, so that it may be used for the same comparison on the next row.

In the SELECT clause of our query, the value of @r is being incremented and returned into the resultset. SELECT clause is evaluated after the WHERE clause.

To emulate SUM, we will need change the query above just a little:

SELECT  id, grouper, (@r := @r + value) AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper SUM(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 3
3 1 6
4 1 10
5 1 15
6 2 1
7 2 5
8 2 14
9 2 30
10 2 55

All that we changed here is an expression in the SELECT clause: each time we add value to @r instead of constant 1.

AVG is a bit more complicated:

SELECT  id, grouper, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND CASE WHEN @_grouper <> grouper THEN @s := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper AVG(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 1.5
3 1 2
4 1 2.5
5 1 3
6 2 1
7 2 2.5
8 2 4.6666666666667
9 2 7.5
10 2 11

AVG, in fact, is SUM() / ROW_NUMBER(). That's why we keep two variables here: @r to hold SUM() and @s to hold ROW_NUMBER(), and initialize them in two expressions in the WHERE clause.

In all the cases above, we emulated analytic functions with one column in PARTITION BY clause and one column in ORDER BY clause. If you need additional columns, you do the following:

SELECT  to.*,
            (@r := @r + `value`) / (@s := @s + 1) AS
            `AVG(value) OVER (PARTITION BY column_p_1, …, column_p_n ORDER BY column_o_1, …, column_o_m)`
FROM    (
        SELECT  @_column_p_1 := 'N',
                  …,
               @_column_p_n := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                table t
        ORDER BY
                column_p_1, …, column_p_n, column_o_1, …, column_o_m
       ) to
WHERE   CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @r := 0 ELSE 0 END IS NOT NULL
        AND CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @s := 0 ELSE 0 END IS NOT NULL
        AND (@_column_p_1 := column_p_1) IS NOT NULL
        …
        AND (@_column_p_n := column_p_n) IS NOT NULL

In the next article, I'll cover some of more complicated analytic functions.

To be continued.

Written by Quassnoi

March 8th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply