## Things SQL needs: SERIES()

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id | source | value |
---|---|---|

1 | 1 | 10 |

2 | 1 | 20 |

3 | 2 | 15 |

4 | 2 | 25 |

5 | 1 | 45 |

6 | 3 | 50 |

7 | 3 | 35 |

8 | 1 | 40 |

9 | 1 | 10 |

The records are ordered by `id`

, and within this order there are continuous series of records which share the same value of `source`

. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: `MIN`

, `MAX`

, `SUM`

, `AVG`

, whatever:

source | min | max | sum | avg |
---|---|---|---|---|

1 | 10 | 20 | 30 | 15.00 |

2 | 15 | 25 | 40 | 20.00 |

1 | 45 | 45 | 45 | 45.00 |

3 | 35 | 50 | 85 | 42.50 |

1 | 10 | 40 | 50 | 25.00 |

This can be used for different things. I used that for:

- Reading sensors from a moving elevator (thus tracking its position)
- Recording user's activity on a site
- Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of `source`

are repeating so a mere `GROUP BY`

won't work here.

In the systems supporting window functions there is a workaround for that:

SELECT source, MIN(value), MAX(value), SUM(value), AVG(value)::NUMERIC(20, 2) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY source ORDER BY id) AS rno, ROW_NUMBER() OVER (ORDER BY id) AS rne FROM ( VALUES (1, 1, 10), (2, 1, 20), (3, 2, 15), (4, 2, 25), (5, 1, 45), (6, 3, 50), (7, 3, 35), (8, 1, 40), (9, 1, 10) ) series (id, source, value) ) q GROUP BY source, rne - rno ORDER BY MIN(id)

source | min | max | sum | avg | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | 20 | 30 | 15.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2 | 15 | 25 | 40 | 20.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 45 | 45 | 45 | 45.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

3 | 35 | 50 | 85 | 42.50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 10 | 40 | 50 | 25.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

5 rows fetched in 0.0004s (0.0033s) |

Sort (cost=1.40..1.42 rows=9 width=28) Sort Key: (min(q.id)) -> HashAggregate (cost=1.01..1.25 rows=9 width=28) -> Subquery Scan q (cost=0.58..0.85 rows=9 width=28) -> WindowAgg (cost=0.58..0.74 rows=9 width=12) -> Sort (cost=0.58..0.60 rows=9 width=12) Sort Key: "*VALUES*".column1 -> WindowAgg (cost=0.26..0.44 rows=9 width=12) -> Sort (cost=0.26..0.28 rows=9 width=12) Sort Key: "*VALUES*".column2, "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.11 rows=9 width=12)

(this is **PostgreSQL**).

The idea behind this solution is that the overall order of `id`

is retained within each of the series but is broken whenever the series break. Thus, a difference between the overall row number and the `source`

-wise row number is an invariant within each series and is guaranteed to change as the series break. Hence, we can `GROUP BY`

on it (along with the `source`

).

This workaround is nice and elegant, however it requires three sorts and one hash aggregate (apart from the support for window functions of course).

### SERIES function

My proposal is to implement a certain extension to the `GROUP BY`

clause, namely:

`GROUP BY SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)`

The logic is quite simple: it would find continuous blocks of `series_expression`

as if the records were ordered by `series_ordering_expression`

and assign them to the same group. Within this group, all standard aggregation functions could be supported.

The expression:

`SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)`

could also serve as a normal window function in the systems that support it. In this case, it would return the ordinal number of the series within each partition.

### Implementation

Its implementation would be quite simple: the records sharing the same value of `partitioning_expression`

should be ordered on `ordering_expression`

, a zero-based (or one-based) counter a variable holding the previous value of the `series_expression`

should be set up. Whenever the value of the `series_expression`

changes, the counter is incremented (and returned as an output of `SERIES`

) and the variable gets the new value of `series_expression`

.

Using the session variables, this can be easily emulated in **MySQL**:

SELECT source, MIN(value) AS min, MAX(value) AS max, SUM(value) AS sum, CAST(AVG(value) AS DECIMAL(20, 2)) AS avg FROM ( SELECT @series := @series + (COALESCE(@source <> source, 0)) AS series, @source := source AS newsource, q.* FROM ( SELECT @series := 1, @source := NULL ) vars STRAIGHT_JOIN ( SELECT 1 AS id, 1 AS source, 10 AS value UNION ALL SELECT 2 AS id, 1 AS source, 20 AS value UNION ALL SELECT 3 AS id, 2 AS source, 15 AS value UNION ALL SELECT 4 AS id, 2 AS source, 25 AS value UNION ALL SELECT 5 AS id, 1 AS source, 45 AS value UNION ALL SELECT 6 AS id, 3 AS source, 50 AS value UNION ALL SELECT 7 AS id, 3 AS source, 35 AS value UNION ALL SELECT 8 AS id, 1 AS source, 40 AS value UNION ALL SELECT 9 AS id, 1 AS source, 10 AS value ) q ORDER BY id ) q GROUP BY series

source | min | max | sum | avg | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | 20 | 30 | 15.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2 | 15 | 25 | 40 | 20.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 45 | 45 | 45 | 45.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

3 | 35 | 50 | 85 | 42.50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 10 | 40 | 50 | 25.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

5 rows fetched in 0.0004s (0.0026s) |

id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|

1 | PRIMARY | <derived2> | ALL | 9 | 100.00 | Using temporary; Using filesort | ||||

2 | DERIVED | <derived3> | system | 1 | 100.00 | Using filesort | ||||

2 | DERIVED | <derived4> | ALL | 9 | 100.00 | |||||

4 | DERIVED | No tables used | ||||||||

5 | UNION | No tables used | ||||||||

6 | UNION | No tables used | ||||||||

7 | UNION | No tables used | ||||||||

8 | UNION | No tables used | ||||||||

9 | UNION | No tables used | ||||||||

10 | UNION | No tables used | ||||||||

11 | UNION | No tables used | ||||||||

12 | UNION | No tables used | ||||||||

UNION RESULT | <union4,5,6,7,8,9,10,11,12> | ALL | ||||||||

3 | DERIVED | No tables used |

select `q`.`source` AS `source`,min(`q`.`value`) AS `min`,max(`q`.`value`) AS `max`,sum(`q`.`value`) AS `sum`,cast(avg(`q`.`value`) as decimal(20,2)) AS `avg` from (select (@series:=((@series) + coalesce(((@source) <> `q`.`source`),0))) AS `series`,(@source:=`q`.`source`) AS `newsource`,`q`.`id` AS `id`,`q`.`source` AS `source`,`q`.`value` AS `value` from (select (@series:=1) AS `@series := 1`,(@source:=NULL) AS `@source := NULL`) `vars` straight_join (select 1 AS `id`,1 AS `source`,10 AS `value` union all select 2 AS `id`,1 AS `source`,20 AS `value` union all select 3 AS `id`,2 AS `source`,15 AS `value` union all select 4 AS `id`,2 AS `source`,25 AS `value` union all select 5 AS `id`,1 AS `source`,45 AS `value` union all select 6 AS `id`,3 AS `source`,50 AS `value` union all select 7 AS `id`,3 AS `source`,35 AS `value` union all select 8 AS `id`,1 AS `source`,40 AS `value` union all select 9 AS `id`,1 AS `source`,10 AS `value`) `q` order by `q`.`id`) `q` group by `q`.`series`

, with only one filesort (which could be avoided if the values were ordered with an index).

### Grouping on expressions sharing the order

This clause would also allow running certain grouping queries more efficiently.

Say, we need to build a yearly report for sales on a web site. Let's create a sample table:

Normally, we do it with this query:

SELECT YEAR(dt), SUM(amount) FROM sales WHERE dt >= '2005-01-01' GROUP BY YEAR(dt) ORDER BY NULL

YEAR(dt) | SUM(amount) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2005 | 7885399.58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2006 | 7888231.80 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2007 | 7881056.65 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2008 | 7894880.23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2009 | 7879728.51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2010 | 7884989.40 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2011 | 1034590.31 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

7 rows fetched in 0.0002s (0.7834s) |

id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|

1 | SIMPLE | sales | range | ix_sales_dt_amount | ix_sales_dt_amount | 8 | 250224 | 100.00 | Using where; Using index; Using temporary |

select year(`20110218_series`.`sales`.`dt`) AS `YEAR(dt)`,sum(`20110218_series`.`sales`.`amount`) AS `SUM(amount)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') group by year(`20110218_series`.`sales`.`dt`) order by NULL

The query is pretty fast, but we see a `using temporary`

in the plan. Since the order of of `dt`

is always that of `YEAR(dt)`

, and the records are selected from the index on `dt`

(which naturally come ordered), the temporary table is redundant. A sort grouping can be used instead.

We have no means to tell **MySQL** (or any other database system) to tell that the orders match, but with the extension proposed, we could do just that:

SELECT YEAR(MIN(dt)), SUM(amount) FROM sales WHERE dt >= '2005-01-01' GROUP BY SERIES(YEAR(dt)) OVER (ORDER BY dt)

Since `YEAR(dt)`

will always form continuous series being ordered by `dt`

, this is the same as the query above, but `filesort`

could be avoided.

Let's emulate this using session variables:

SELECT CAST(year AS UNSIGNED), CAST(sum_amount AS DECIMAL(20, 2)) FROM ( SELECT @sum := 0, @year := YEAR(MIN(dt)) FROM sales WHERE dt >= '2005-01-01' ) vars STRAIGHT_JOIN ( SELECT @sum AS sum_amount, @sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END, @year = YEAR(dt) AS series, @year AS year, @year := YEAR(dt) FROM sales WHERE dt >= '2005-01-01' ORDER BY dt ) q WHERE series = 0 UNION ALL SELECT CAST(@year AS UNSIGNED), CAST(@sum AS DECIMAL(20, 2))

CAST(year AS UNSIGNED) | CAST(sum_amount AS DECIMAL(20, 2)) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2005 | 7885399.58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2006 | 7888231.80 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2007 | 7881056.65 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2008 | 7894880.23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2009 | 7879728.51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2010 | 7884989.40 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2011 | 1034590.31 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

7 rows fetched in 0.0002s (0.5375s) |

id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|

1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||

1 | PRIMARY | <derived3> | ALL | 322416 | 100.00 | Using where | ||||

3 | DERIVED | sales | range | ix_sales_dt_amount | ix_sales_dt_amount | 8 | 250224 | 100.00 | Using where; Using index | |

2 | DERIVED | Select tables optimized away | ||||||||

4 | UNION | No tables used | ||||||||

UNION RESULT | <union1,4> | ALL |

select cast(`q`.`year` as unsigned) AS `CAST(year AS UNSIGNED)`,cast(`q`.`sum_amount` as decimal(20,2)) AS `CAST(sum_amount AS DECIMAL(20, 2))` from (select (@sum:=0) AS `@sum := 0`,(@year:=year(min(`20110218_series`.`sales`.`dt`))) AS `@year := YEAR(MIN(dt))` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01')) `vars` straight_join (select (@sum) AS `sum_amount`,(@sum:=(case when ((@year) = year(`20110218_series`.`sales`.`dt`)) then ((@sum) + `20110218_series`.`sales`.`amount`) else `20110218_series`.`sales`.`amount` end)) AS `@sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END`,((@year) = year(`20110218_series`.`sales`.`dt`)) AS `series`,(@year) AS `year`,(@year:=year(`20110218_series`.`sales`.`dt`)) AS `@year := YEAR(dt)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') order by `20110218_series`.`sales`.`dt`) `q` where (`q`.`series` = 0) union all select cast((@year) as unsigned) AS `CAST(@year AS UNSIGNED)`,cast((@sum) as decimal(20,2)) AS `CAST(@sum AS DECIMAL(20, 2))`

We see that though inline view support is pretty inefficient in **MySQL** and the resulting query is not as fast as it should be, there is no `using temporary`

and the query is more efficient. With a native support for `SERIES()`

, it would be yet more fast.

### Conclusion

A window function (or an extension for `GROUP BY`

clause) that would allow grouping continuous series of an expression in an ordered dataset would ease writing queries that need to aggregate such series, make them more efficient and even improve certain kinds of grouping queries where the grouping expression shares its order with one of the indexes.

### Update of Feb 21 2011

Another nice solution proposed by **@delostilos**:

SELECT MIN(source), series, SUM(value) FROM ( SELECT *, SUM(COALESCE((source <> ns)::INTEGER, 0)) OVER (ORDER BY id) AS series FROM ( SELECT series.*, LAG(source) OVER (ORDER BY id) AS ns FROM ( VALUES (1, 1, 10), (2, 1, 20), (3, 2, 15), (4, 2, 25), (5, 1, 45), (6, 3, 50), (7, 3, 35), (8, 1, 40), (9, 1, 10) ) series (id, source, value) ) q ) q GROUP BY series

min | series | sum | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 1 | 40 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 2 | 45 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

3 | 3 | 85 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 4 | 50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

1 | 0 | 30 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

5 rows fetched in 0.0005s (0.0030s) |

HashAggregate (cost=0.84..0.98 rows=9 width=16) -> WindowAgg (cost=0.26..0.68 rows=9 width=16) -> WindowAgg (cost=0.26..0.41 rows=9 width=12) -> Sort (cost=0.26..0.28 rows=9 width=12) Sort Key: "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.11 rows=9 width=12)

This is more efficient than double `ROW_NUMBER`

, since the engine only needs to sort on the `id`

expression at most once (or does not need to sort at all if it comes sorted from the previous recordset or from the index).

However, the final `GROUP BY`

still cannot use sort aggregation without additional sorting, since the output of the analytic `SUM()`

cannot be assumed to have the same order as `id`

. That's why the native support from the system would still be an improvement.

Regarding “Grouping on expressions sharing the order” in MySQL — if you use “ORDER BY NULL”, you can prevent the GROUP BY from automatically trying to sort, thus saving the filesort. EXPLAIN easy confirms this fact.

Fenway19 Feb 11 at 05:46

@Fenway: nice point, thanks. Actually, it’s

`using temporary`

that takes the time.Quassnoi19 Feb 11 at 17:08

@Quassnoi: well, yes, GROUPing by an expression with do that; but you mentioned filesort specifically — hence the comment. By the way, this is one of the best database blogs out there — I’m an avid reader.

Fenway19 Feb 11 at 21:45

@Fenway: sure, fixed this now. Thanks!

Quassnoi19 Feb 11 at 22:58

Hi,

I did try to post earlier, but I think it didn’t came thru. With two window functions You can do the same as in the MySQL function. Use the LAG() window funtion to compare the current ‘source’ with the previous and set the ‘newgroup’ column to 1 if they differ. Then use the ‘newgroup’ in a moving/running sum, the result of the running sum will be the group number. Both the window functions are ordered on ‘id’, so it can skip one sort action.

Regards,

JJ.

delostilos21 Feb 11 at 03:32

@delostilos: that’s a nice solution too, thanks! However, a support from the system would be still a bonus since the system does not know that the analytic

`SUM`

has the same order as`id`

and hence cannot use`SORT GROUP BY`

without additional sorting.Quassnoi21 Feb 11 at 14:17

Hi,

Would it be possible in PostgreSQL to create an user defined aggregate series() that combines the lag and the sum?

Regards,

JJ.

delostilos28 Feb 11 at 01:13

@delostilos: yes it is possible to create such a function (it would be window, not aggregate and hence should be written as a linked library). However, I doubt it is possible to convince

PostgreSQLthat is preserves the order.Quassnoi28 Feb 11 at 13:48

Just read “MySQL ORDER BY / LIMIT performance: late row lookups”; something of a revelation. Good stuff!

Stevie10 Mar 11 at 13:16