EXPLAIN EXTENDED

How to create fast database queries

Analytic functions: SUM, AVG, ROW_NUMBER

with 5 comments

In one of the previous articles I wrote about emulating some of analytic functions in MySQL.

Now, I'd like to cover this question more extensively.

A quick reminder: an analytic function is a function that behaves like an aggregate function with one exception: aggregate function returns one last row for each aggregated set, while an analytic function returns intermediate results too.

An analytic function can be made out of almost all aggregate functions by adding keyword OVER to them with two additional clauses: PARTITION BY and ORDER BY.

PARTITION BY is analog of GROUP BY. ORDER BY defines order in which the intermediate rows will be evaluated.

The behaviour of analytic functions can probably be best illustrated with an example:

SELECT *
FROM    t_aggregator
id grouper value
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 4
8 2 9
9 2 16
10 2 25

This is aggregate function SUM. It outputs a sum of all value's having the same grouper.

SELECT  grouper, SUM(value)
FROM    t_aggregator
GROUP BY
        grouper
grouper SUM(value)
1 15
2 55

We can convert it into analytic function SUM by adding keyword OVER:

SELECT  grouper, SUM(value) OVER (PARTITION BY grouper ORDER BY id)
FROM    t_aggregator
id grouper SUM(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 3
3 1 6
4 1 10
5 1 15
6 2 1
7 2 5
8 2 14
9 2 30
10 2 55

As you can see, analytic SUM outputs intermediate results for each row: a sum of all value's from 1st to 1st row, then from 1st to 2nd, etc. The intermediate result is being reset as the grouper changes.

Values of the last row in each grouper are the sum of all value's from this grouper. It's exactly what aggregate SUM returns.

MySQL lacks these functions, so we need to emulate them somehow if we want to use their functionality.

Today I'll write about emulating three analytic functions: SUM, AVG and ROW_NUMBER(). ROW_NUMBER() is analytic analog of aggregate function COUNT(*). It has other name to be more descriptive, as it returns count of rows selected so far, which in fact is just a row number.

These function are simplest to emulate as their intermediate values are defined by (and only by) previous fields. In our example there are 5 rows for eacher grouper, but if you need a sum, average or count of the first two, you don't need to know other three.

To emulate the functions, we will need to use session variables to keep the intermediate data.

ROW_NUMBER() is probably the simplest of these functions to emulate, as it doesn't even need the column values:

SELECT  id, grouper, (@r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5

We use two session variables here: @_grouper and @r. @r is used to store intermediate data, while @_grouper is used to reset @r when we enter another group.

@_grouper is initialized to a value that cannot equal to any of real grouper's. I chose literal 'N' in this case, as grouper is declared as unsigned int.

@r is initialized to 0 in the WHERE clause each time a grouper changes, this is when the current value of grouper is not equal to previous (or initial) value of @_grouper. The @_grouper then is being set to the current value of grouper, so that it may be used for the same comparison on the next row.

In the SELECT clause of our query, the value of @r is being incremented and returned into the resultset. SELECT clause is evaluated after the WHERE clause.

To emulate SUM, we will need change the query above just a little:

SELECT  id, grouper, (@r := @r + value) AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper SUM(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 3
3 1 6
4 1 10
5 1 15
6 2 1
7 2 5
8 2 14
9 2 30
10 2 55

All that we changed here is an expression in the SELECT clause: each time we add value to @r instead of constant 1.

AVG is a bit more complicated:

SELECT  id, grouper, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY grouper ORDER BY id)`
FROM    (
        SELECT  @_grouper := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                t_aggregator a
        ORDER BY
                grouper, id
        ) ao
WHERE   CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL
        AND CASE WHEN @_grouper <> grouper THEN @s := 0 ELSE 0 END IS NOT NULL
        AND (@_grouper := grouper) IS NOT NULL
id grouper AVG(value) OVER (PARTITION BY grouper ORDER BY id)
1 1 1
2 1 1.5
3 1 2
4 1 2.5
5 1 3
6 2 1
7 2 2.5
8 2 4.6666666666667
9 2 7.5
10 2 11

AVG, in fact, is SUM() / ROW_NUMBER(). That's why we keep two variables here: @r to hold SUM() and @s to hold ROW_NUMBER(), and initialize them in two expressions in the WHERE clause.

In all the cases above, we emulated analytic functions with one column in PARTITION BY clause and one column in ORDER BY clause. If you need additional columns, you do the following:

SELECT  to.*,
            (@r := @r + `value`) / (@s := @s + 1) AS
            `AVG(value) OVER (PARTITION BY column_p_1, …, column_p_n ORDER BY column_o_1, …, column_o_m)`
FROM    (
        SELECT  @_column_p_1 := 'N',
                  …,
               @_column_p_n := 'N'
        ) vars,
        (
        SELECT  *
        FROM
                table t
        ORDER BY
                column_p_1, …, column_p_n, column_o_1, …, column_o_m
       ) to
WHERE   CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @r := 0 ELSE 0 END IS NOT NULL
        AND CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @s := 0 ELSE 0 END IS NOT NULL
        AND (@_column_p_1 := column_p_1) IS NOT NULL
        …
        AND (@_column_p_n := column_p_n) IS NOT NULL

In the next article, I'll cover some of more complicated analytic functions.

To be continued.

Written by Quassnoi

March 8th, 2009 at 11:00 pm

Posted in MySQL

5 Responses to 'Analytic functions: SUM, AVG, ROW_NUMBER'

Subscribe to comments with RSS

  1. Hello
    Firstly, thanks for posting these tips.

    I have been trying for a while now to get the ROW_NUMBER() function as you have described it here to work, but to no avail. I created a table just like yours and have the code just like yours but the “ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)” column simply increments, that is, it doesn’t reset to 0 after the grouper value changes from 1 to 2. I have found an alternative solution to my problem but would like to see the code you have listed here working. Below is what I have

    Thank You!

    Select customer,orderdate, (@r:=@r+1) as ‘Row Num’
    FROM
    (select @_grouper := ‘N’) X,
    (select * from test_cust t order by customer) Y
    WHERE
    CASE WHEN @_grouper customer THEN @r := 0 ELSE 0 END IS NOT NULL
    AND (@_grouper := customer) IS NOT NULL

    Raj

    4 Jan 16 at 03:03

  2. Hello. I’m using aurora reds (equivalent of mysql) I’ve applied your code to my case but it’s not working. I’m getting an error around the partition area. Is it possible to get in contact with you through email or by phone to troubleshoot this further. I cannot recreate my case here because it would be too much of a job in itself and I can’t post my real data for obvious reasons. Thanks

    FAR

    17 Feb 16 at 00:46

  3. @FAR: sure, use the feedback form: https://explainextended.com/drop-me-a-line

    Quassnoi

    17 Feb 16 at 00:52

  4. Nice article. Wondering if there is a way to have a column alias for the line:

    @r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)`

    Regards,

    Ram

    Ram

    24 Jul 17 at 20:48

  5. Nice article . It help in my project.

    swapnil kambe

    19 Sep 18 at 13:40

Leave a Reply