## Analytic functions: SUM, AVG, ROW_NUMBER

Comments enabled. I *really* need your comment

In one of the previous articles I wrote about emulating some of analytic functions in **MySQL**.

Now, I'd like to cover this question more extensively.

A quick reminder: an **analytic function** is a function that behaves like an **aggregate function** with one exception: **aggregate function** returns one last row for each aggregated set, while an **analytic function** returns intermediate results too.

An **analytic function** can be made out of almost all **aggregate functions** by adding keyword `OVER`

to them with two additional clauses: `PARTITION BY`

and `ORDER BY`

.

`PARTITION BY`

is analog of `GROUP BY`

. `ORDER BY`

defines order in which the intermediate rows will be evaluated.

The behaviour of analytic functions can probably be best illustrated with an example:

SELECT * FROM t_aggregator

id | grouper | value |
---|---|---|

1 | 1 | 1 |

2 | 1 | 2 |

3 | 1 | 3 |

4 | 1 | 4 |

5 | 1 | 5 |

6 | 2 | 1 |

7 | 2 | 4 |

8 | 2 | 9 |

9 | 2 | 16 |

10 | 2 | 25 |

This is aggregate function `SUM`

. It outputs a sum of all `value`

's having the same `grouper`

.

SELECT grouper, SUM(value) FROM t_aggregator GROUP BY grouper

grouper | SUM(value) |
---|---|

1 | 15 |

2 | 55 |

We can convert it into analytic function `SUM`

by adding keyword `OVER`

:

SELECT grouper, SUM(value) OVER (PARTITION BY grouper ORDER BY id) FROM t_aggregator

id | grouper | SUM(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|

1 | 1 | 1 |

2 | 1 | 3 |

3 | 1 | 6 |

4 | 1 | 10 |

5 | 1 | 15 |

6 | 2 | 1 |

7 | 2 | 5 |

8 | 2 | 14 |

9 | 2 | 30 |

10 | 2 | 55 |

As you can see, analytic `SUM`

outputs intermediate results for each row: a sum of all `value`

's from **1st** to **1st** row, then from **1st** to **2nd**, etc. The intermediate result is being reset

as the `grouper`

changes.

Values of the last row in each `grouper`

are the sum of all `value`

's from this `grouper`

. It's exactly what aggregate `SUM`

returns.

**MySQL** lacks these functions, so we need to emulate them somehow if we want to use their functionality.

Today I'll write about emulating three analytic functions: `SUM`

, `AVG`

and `ROW_NUMBER()`

. `ROW_NUMBER()`

is analytic analog of aggregate function `COUNT(*)`

. It has other name to be more descriptive, as it returns count of rows selected so far

, which in fact is just a row number.

These function are simplest to emulate as their intermediate values are defined by (and only by) previous fields. In our example there are 5 rows for eacher grouper, but if you need a sum, average or count of the first two, you don't need to know other three.

To emulate the functions, we will need to use session variables to keep the intermediate data.

`ROW_NUMBER()`

is probably the simplest of these functions to emulate, as it doesn't even need the column values:

SELECT id, grouper, (@r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL

id | grouper | ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|

1 | 1 | 1 |

2 | 1 | 2 |

3 | 1 | 3 |

4 | 1 | 4 |

5 | 1 | 5 |

6 | 2 | 1 |

7 | 2 | 2 |

8 | 2 | 3 |

9 | 2 | 4 |

10 | 2 | 5 |

We use two session variables here: `@_grouper`

and `@r`

. `@r`

is used to store intermediate data, while `@_grouper`

is used to reset `@r`

when we enter another group.

`@_grouper`

is initialized to a value that cannot equal to any of real `grouper`

's. I chose literal **'N'** in this case, as `grouper`

is declared as `unsigned int`

.

@r is initialized to `0`

in the `WHERE`

clause each time a grouper changes, this is when the current value of `grouper`

is not equal to previous (or initial) value of `@_grouper`

. The `@_grouper`

then is being set to the current value of grouper, so that it may be used for the same comparison on the next row.

In the `SELECT`

clause of our query, the value of `@r`

is being incremented and returned into the resultset. `SELECT`

clause is evaluated after the `WHERE`

clause.

To emulate `SUM`

, we will need change the query above just a little:

SELECT id, grouper, (@r := @r + value) AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL

id | grouper | SUM(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|

1 | 1 | 1 |

2 | 1 | 3 |

3 | 1 | 6 |

4 | 1 | 10 |

5 | 1 | 15 |

6 | 2 | 1 |

7 | 2 | 5 |

8 | 2 | 14 |

9 | 2 | 30 |

10 | 2 | 55 |

All that we changed here is an expression in the `SELECT`

clause: each time we add `value`

to `@r`

instead of constant **1**.

`AVG`

is a bit more complicated:

SELECT id, grouper, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND CASE WHEN @_grouper <> grouper THEN @s := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL

id | grouper | AVG(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|

1 | 1 | 1 |

2 | 1 | 1.5 |

3 | 1 | 2 |

4 | 1 | 2.5 |

5 | 1 | 3 |

6 | 2 | 1 |

7 | 2 | 2.5 |

8 | 2 | 4.6666666666667 |

9 | 2 | 7.5 |

10 | 2 | 11 |

`AVG`

, in fact, is `SUM() / ROW_NUMBER()`

. That's why we keep two variables here: `@r`

to hold `SUM()`

and `@s`

to hold `ROW_NUMBER()`

, and initialize them in two expressions in the `WHERE`

clause.

In all the cases above, we emulated analytic functions with one column in `PARTITION BY`

clause and one column in `ORDER BY`

clause. If you need additional columns, you do the following:

SELECT to.*, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY column_p_1, …, column_p_n ORDER BY column_o_1, …, column_o_m)` FROM ( SELECT @_column_p_1 := 'N', …, @_column_p_n := 'N' ) vars, ( SELECT * FROM table t ORDER BY column_p_1, …, column_p_n, column_o_1, …, column_o_m ) to WHERE CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @r := 0 ELSE 0 END IS NOT NULL AND CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @s := 0 ELSE 0 END IS NOT NULL AND (@_column_p_1 := column_p_1) IS NOT NULL … AND (@_column_p_n := column_p_n) IS NOT NULL

In the next article, I'll cover some of more complicated analytic functions.

**To be continued.**