Analytic functions: SUM, AVG, ROW_NUMBER
In one of the previous articles I wrote about emulating some of analytic functions in MySQL.
Now, I'd like to cover this question more extensively.
A quick reminder: an analytic function is a function that behaves like an aggregate function with one exception: aggregate function returns one last row for each aggregated set, while an analytic function returns intermediate results too.
An analytic function can be made out of almost all aggregate functions by adding keyword OVER
to them with two additional clauses: PARTITION BY
and ORDER BY
.
PARTITION BY
is analog of GROUP BY
. ORDER BY
defines order in which the intermediate rows will be evaluated.
The behaviour of analytic functions can probably be best illustrated with an example:
SELECT * FROM t_aggregator
id | grouper | value |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 1 | 4 |
5 | 1 | 5 |
6 | 2 | 1 |
7 | 2 | 4 |
8 | 2 | 9 |
9 | 2 | 16 |
10 | 2 | 25 |
This is aggregate function SUM
. It outputs a sum of all value
's having the same grouper
.
SELECT grouper, SUM(value) FROM t_aggregator GROUP BY grouper
grouper | SUM(value) |
---|---|
1 | 15 |
2 | 55 |
We can convert it into analytic function SUM
by adding keyword OVER
:
SELECT grouper, SUM(value) OVER (PARTITION BY grouper ORDER BY id) FROM t_aggregator
id | grouper | SUM(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|
1 | 1 | 1 |
2 | 1 | 3 |
3 | 1 | 6 |
4 | 1 | 10 |
5 | 1 | 15 |
6 | 2 | 1 |
7 | 2 | 5 |
8 | 2 | 14 |
9 | 2 | 30 |
10 | 2 | 55 |
As you can see, analytic SUM
outputs intermediate results for each row: a sum of all value
's from 1st to 1st row, then from 1st to 2nd, etc. The intermediate result is being reset
as the grouper
changes.
Values of the last row in each grouper
are the sum of all value
's from this grouper
. It's exactly what aggregate SUM
returns.
MySQL lacks these functions, so we need to emulate them somehow if we want to use their functionality.
Today I'll write about emulating three analytic functions: SUM
, AVG
and ROW_NUMBER()
. ROW_NUMBER()
is analytic analog of aggregate function COUNT(*)
. It has other name to be more descriptive, as it returns count of rows selected so far
, which in fact is just a row number.
These function are simplest to emulate as their intermediate values are defined by (and only by) previous fields. In our example there are 5 rows for eacher grouper, but if you need a sum, average or count of the first two, you don't need to know other three.
To emulate the functions, we will need to use session variables to keep the intermediate data.
ROW_NUMBER()
is probably the simplest of these functions to emulate, as it doesn't even need the column values:
SELECT id, grouper, (@r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL
id | grouper | ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 1 | 4 |
5 | 1 | 5 |
6 | 2 | 1 |
7 | 2 | 2 |
8 | 2 | 3 |
9 | 2 | 4 |
10 | 2 | 5 |
We use two session variables here: @_grouper
and @r
. @r
is used to store intermediate data, while @_grouper
is used to reset @r
when we enter another group.
@_grouper
is initialized to a value that cannot equal to any of real grouper
's. I chose literal 'N' in this case, as grouper
is declared as unsigned int
.
@r is initialized to 0
in the WHERE
clause each time a grouper changes, this is when the current value of grouper
is not equal to previous (or initial) value of @_grouper
. The @_grouper
then is being set to the current value of grouper, so that it may be used for the same comparison on the next row.
In the SELECT
clause of our query, the value of @r
is being incremented and returned into the resultset. SELECT
clause is evaluated after the WHERE
clause.
To emulate SUM
, we will need change the query above just a little:
SELECT id, grouper, (@r := @r + value) AS `SUM(value) OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL
id | grouper | SUM(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|
1 | 1 | 1 |
2 | 1 | 3 |
3 | 1 | 6 |
4 | 1 | 10 |
5 | 1 | 15 |
6 | 2 | 1 |
7 | 2 | 5 |
8 | 2 | 14 |
9 | 2 | 30 |
10 | 2 | 55 |
All that we changed here is an expression in the SELECT
clause: each time we add value
to @r
instead of constant 1.
AVG
is a bit more complicated:
SELECT id, grouper, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY grouper ORDER BY id)` FROM ( SELECT @_grouper := 'N' ) vars, ( SELECT * FROM t_aggregator a ORDER BY grouper, id ) ao WHERE CASE WHEN @_grouper <> grouper THEN @r := 0 ELSE 0 END IS NOT NULL AND CASE WHEN @_grouper <> grouper THEN @s := 0 ELSE 0 END IS NOT NULL AND (@_grouper := grouper) IS NOT NULL
id | grouper | AVG(value) OVER (PARTITION BY grouper ORDER BY id) |
---|---|---|
1 | 1 | 1 |
2 | 1 | 1.5 |
3 | 1 | 2 |
4 | 1 | 2.5 |
5 | 1 | 3 |
6 | 2 | 1 |
7 | 2 | 2.5 |
8 | 2 | 4.6666666666667 |
9 | 2 | 7.5 |
10 | 2 | 11 |
AVG
, in fact, is SUM() / ROW_NUMBER()
. That's why we keep two variables here: @r
to hold SUM()
and @s
to hold ROW_NUMBER()
, and initialize them in two expressions in the WHERE
clause.
In all the cases above, we emulated analytic functions with one column in PARTITION BY
clause and one column in ORDER BY
clause. If you need additional columns, you do the following:
SELECT to.*, (@r := @r + `value`) / (@s := @s + 1) AS `AVG(value) OVER (PARTITION BY column_p_1, …, column_p_n ORDER BY column_o_1, …, column_o_m)` FROM ( SELECT @_column_p_1 := 'N', …, @_column_p_n := 'N' ) vars, ( SELECT * FROM table t ORDER BY column_p_1, …, column_p_n, column_o_1, …, column_o_m ) to WHERE CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @r := 0 ELSE 0 END IS NOT NULL AND CASE WHEN @_column_p_1 <> column_p_1 OR … OR @_column_p_n <> column_p_n THEN @s := 0 ELSE 0 END IS NOT NULL AND (@_column_p_1 := column_p_1) IS NOT NULL … AND (@_column_p_n := column_p_n) IS NOT NULL
In the next article, I'll cover some of more complicated analytic functions.
To be continued.
Hello
Firstly, thanks for posting these tips.
I have been trying for a while now to get the ROW_NUMBER() function as you have described it here to work, but to no avail. I created a table just like yours and have the code just like yours but the “ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)” column simply increments, that is, it doesn’t reset to 0 after the grouper value changes from 1 to 2. I have found an alternative solution to my problem but would like to see the code you have listed here working. Below is what I have
Thank You!
Select customer,orderdate, (@r:=@r+1) as ‘Row Num’
FROM
(select @_grouper := ‘N’) X,
(select * from test_cust t order by customer) Y
WHERE
CASE WHEN @_grouper customer THEN @r := 0 ELSE 0 END IS NOT NULL
AND (@_grouper := customer) IS NOT NULL
Raj
4 Jan 16 at 03:03
Hello. I’m using aurora reds (equivalent of mysql) I’ve applied your code to my case but it’s not working. I’m getting an error around the partition area. Is it possible to get in contact with you through email or by phone to troubleshoot this further. I cannot recreate my case here because it would be too much of a job in itself and I can’t post my real data for obvious reasons. Thanks
FAR
17 Feb 16 at 00:46
@FAR: sure, use the feedback form: https://explainextended.com/drop-me-a-line
Quassnoi
17 Feb 16 at 00:52
Nice article. Wondering if there is a way to have a column alias for the line:
@r := @r + 1) AS `ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id)`
Regards,
Ram
Ram
24 Jul 17 at 20:48
Nice article . It help in my project.
swapnil kambe
19 Sep 18 at 13:40