Things SQL needs: SERIES()
Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.
I believe this could be best illustrated with an example:
id | source | value |
---|---|---|
1 | 1 | 10 |
2 | 1 | 20 |
3 | 2 | 15 |
4 | 2 | 25 |
5 | 1 | 45 |
6 | 3 | 50 |
7 | 3 | 35 |
8 | 1 | 40 |
9 | 1 | 10 |
The records are ordered by id
, and within this order there are continuous series of records which share the same value of source
. In the table above, the series are separated by thick lines.
We want to calculate some aggregates across each of the series: MIN
, MAX
, SUM
, AVG
, whatever:
source | min | max | sum | avg |
---|---|---|---|---|
1 | 10 | 20 | 30 | 15.00 |
2 | 15 | 25 | 40 | 20.00 |
1 | 45 | 45 | 45 | 45.00 |
3 | 35 | 50 | 85 | 42.50 |
1 | 10 | 40 | 50 | 25.00 |
This can be used for different things. I used that for:
- Reading sensors from a moving elevator (thus tracking its position)
- Recording user's activity on a site
- Tracking the primary node in a server cluster
, but almost any seasoned database developer can recall a need for such a query.
As you can see, the values of source
are repeating so a mere GROUP BY
won't work here.
In the systems supporting window functions there is a workaround for that:
SELECT source, MIN(value), MAX(value), SUM(value), AVG(value)::NUMERIC(20, 2) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY source ORDER BY id) AS rno, ROW_NUMBER() OVER (ORDER BY id) AS rne FROM ( VALUES (1, 1, 10), (2, 1, 20), (3, 2, 15), (4, 2, 25), (5, 1, 45), (6, 3, 50), (7, 3, 35), (8, 1, 40), (9, 1, 10) ) series (id, source, value) ) q GROUP BY source, rne - rno ORDER BY MIN(id)
source | min | max | sum | avg | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 15.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 15 | 25 | 40 | 20.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 45 | 45 | 45 | 45.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 35 | 50 | 85 | 42.50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 10 | 40 | 50 | 25.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0004s (0.0033s) |
Sort (cost=1.40..1.42 rows=9 width=28) Sort Key: (min(q.id)) -> HashAggregate (cost=1.01..1.25 rows=9 width=28) -> Subquery Scan q (cost=0.58..0.85 rows=9 width=28) -> WindowAgg (cost=0.58..0.74 rows=9 width=12) -> Sort (cost=0.58..0.60 rows=9 width=12) Sort Key: "*VALUES*".column1 -> WindowAgg (cost=0.26..0.44 rows=9 width=12) -> Sort (cost=0.26..0.28 rows=9 width=12) Sort Key: "*VALUES*".column2, "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.11 rows=9 width=12)
(this is PostgreSQL).
The idea behind this solution is that the overall order of id
is retained within each of the series but is broken whenever the series break. Thus, a difference between the overall row number and the source
-wise row number is an invariant within each series and is guaranteed to change as the series break. Hence, we can GROUP BY
on it (along with the source
).
This workaround is nice and elegant, however it requires three sorts and one hash aggregate (apart from the support for window functions of course).
SERIES function
My proposal is to implement a certain extension to the GROUP BY
clause, namely:
GROUP BY SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)
The logic is quite simple: it would find continuous blocks of series_expression
as if the records were ordered by series_ordering_expression
and assign them to the same group. Within this group, all standard aggregation functions could be supported.
The expression:
SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)
could also serve as a normal window function in the systems that support it. In this case, it would return the ordinal number of the series within each partition.
Implementation
Its implementation would be quite simple: the records sharing the same value of partitioning_expression
should be ordered on ordering_expression
, a zero-based (or one-based) counter a variable holding the previous value of the series_expression
should be set up. Whenever the value of the series_expression
changes, the counter is incremented (and returned as an output of SERIES
) and the variable gets the new value of series_expression
.
Using the session variables, this can be easily emulated in MySQL:
SELECT source, MIN(value) AS min, MAX(value) AS max, SUM(value) AS sum, CAST(AVG(value) AS DECIMAL(20, 2)) AS avg FROM ( SELECT @series := @series + (COALESCE(@source <> source, 0)) AS series, @source := source AS newsource, q.* FROM ( SELECT @series := 1, @source := NULL ) vars STRAIGHT_JOIN ( SELECT 1 AS id, 1 AS source, 10 AS value UNION ALL SELECT 2 AS id, 1 AS source, 20 AS value UNION ALL SELECT 3 AS id, 2 AS source, 15 AS value UNION ALL SELECT 4 AS id, 2 AS source, 25 AS value UNION ALL SELECT 5 AS id, 1 AS source, 45 AS value UNION ALL SELECT 6 AS id, 3 AS source, 50 AS value UNION ALL SELECT 7 AS id, 3 AS source, 35 AS value UNION ALL SELECT 8 AS id, 1 AS source, 40 AS value UNION ALL SELECT 9 AS id, 1 AS source, 10 AS value ) q ORDER BY id ) q GROUP BY series
source | min | max | sum | avg | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 15.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 15 | 25 | 40 | 20.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 45 | 45 | 45 | 45.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 35 | 50 | 85 | 42.50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 10 | 40 | 50 | 25.00 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0004s (0.0026s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 9 | 100.00 | Using temporary; Using filesort | ||||
2 | DERIVED | <derived3> | system | 1 | 100.00 | Using filesort | ||||
2 | DERIVED | <derived4> | ALL | 9 | 100.00 | |||||
4 | DERIVED | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
8 | UNION | No tables used | ||||||||
9 | UNION | No tables used | ||||||||
10 | UNION | No tables used | ||||||||
11 | UNION | No tables used | ||||||||
12 | UNION | No tables used | ||||||||
UNION RESULT | <union4,5,6,7,8,9,10,11,12> | ALL | ||||||||
3 | DERIVED | No tables used |
select `q`.`source` AS `source`,min(`q`.`value`) AS `min`,max(`q`.`value`) AS `max`,sum(`q`.`value`) AS `sum`,cast(avg(`q`.`value`) as decimal(20,2)) AS `avg` from (select (@series:=((@series) + coalesce(((@source) <> `q`.`source`),0))) AS `series`,(@source:=`q`.`source`) AS `newsource`,`q`.`id` AS `id`,`q`.`source` AS `source`,`q`.`value` AS `value` from (select (@series:=1) AS `@series := 1`,(@source:=NULL) AS `@source := NULL`) `vars` straight_join (select 1 AS `id`,1 AS `source`,10 AS `value` union all select 2 AS `id`,1 AS `source`,20 AS `value` union all select 3 AS `id`,2 AS `source`,15 AS `value` union all select 4 AS `id`,2 AS `source`,25 AS `value` union all select 5 AS `id`,1 AS `source`,45 AS `value` union all select 6 AS `id`,3 AS `source`,50 AS `value` union all select 7 AS `id`,3 AS `source`,35 AS `value` union all select 8 AS `id`,1 AS `source`,40 AS `value` union all select 9 AS `id`,1 AS `source`,10 AS `value`) `q` order by `q`.`id`) `q` group by `q`.`series`
, with only one filesort (which could be avoided if the values were ordered with an index).
Grouping on expressions sharing the order
This clause would also allow running certain grouping queries more efficiently.
Say, we need to build a yearly report for sales on a web site. Let's create a sample table:
Normally, we do it with this query:
SELECT YEAR(dt), SUM(amount) FROM sales WHERE dt >= '2005-01-01' GROUP BY YEAR(dt) ORDER BY NULL
YEAR(dt) | SUM(amount) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 | 7885399.58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2006 | 7888231.80 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2007 | 7881056.65 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008 | 7894880.23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009 | 7879728.51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2010 | 7884989.40 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2011 | 1034590.31 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 rows fetched in 0.0002s (0.7834s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | sales | range | ix_sales_dt_amount | ix_sales_dt_amount | 8 | 250224 | 100.00 | Using where; Using index; Using temporary |
select year(`20110218_series`.`sales`.`dt`) AS `YEAR(dt)`,sum(`20110218_series`.`sales`.`amount`) AS `SUM(amount)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') group by year(`20110218_series`.`sales`.`dt`) order by NULL
The query is pretty fast, but we see a using temporary
in the plan. Since the order of of dt
is always that of YEAR(dt)
, and the records are selected from the index on dt
(which naturally come ordered), the temporary table is redundant. A sort grouping can be used instead.
We have no means to tell MySQL (or any other database system) to tell that the orders match, but with the extension proposed, we could do just that:
SELECT YEAR(MIN(dt)), SUM(amount) FROM sales WHERE dt >= '2005-01-01' GROUP BY SERIES(YEAR(dt)) OVER (ORDER BY dt)
Since YEAR(dt)
will always form continuous series being ordered by dt
, this is the same as the query above, but filesort
could be avoided.
Let's emulate this using session variables:
SELECT CAST(year AS UNSIGNED), CAST(sum_amount AS DECIMAL(20, 2)) FROM ( SELECT @sum := 0, @year := YEAR(MIN(dt)) FROM sales WHERE dt >= '2005-01-01' ) vars STRAIGHT_JOIN ( SELECT @sum AS sum_amount, @sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END, @year = YEAR(dt) AS series, @year AS year, @year := YEAR(dt) FROM sales WHERE dt >= '2005-01-01' ORDER BY dt ) q WHERE series = 0 UNION ALL SELECT CAST(@year AS UNSIGNED), CAST(@sum AS DECIMAL(20, 2))
CAST(year AS UNSIGNED) | CAST(sum_amount AS DECIMAL(20, 2)) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 | 7885399.58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2006 | 7888231.80 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2007 | 7881056.65 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008 | 7894880.23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009 | 7879728.51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2010 | 7884989.40 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2011 | 1034590.31 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 rows fetched in 0.0002s (0.5375s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||
1 | PRIMARY | <derived3> | ALL | 322416 | 100.00 | Using where | ||||
3 | DERIVED | sales | range | ix_sales_dt_amount | ix_sales_dt_amount | 8 | 250224 | 100.00 | Using where; Using index | |
2 | DERIVED | Select tables optimized away | ||||||||
4 | UNION | No tables used | ||||||||
UNION RESULT | <union1,4> | ALL |
select cast(`q`.`year` as unsigned) AS `CAST(year AS UNSIGNED)`,cast(`q`.`sum_amount` as decimal(20,2)) AS `CAST(sum_amount AS DECIMAL(20, 2))` from (select (@sum:=0) AS `@sum := 0`,(@year:=year(min(`20110218_series`.`sales`.`dt`))) AS `@year := YEAR(MIN(dt))` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01')) `vars` straight_join (select (@sum) AS `sum_amount`,(@sum:=(case when ((@year) = year(`20110218_series`.`sales`.`dt`)) then ((@sum) + `20110218_series`.`sales`.`amount`) else `20110218_series`.`sales`.`amount` end)) AS `@sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END`,((@year) = year(`20110218_series`.`sales`.`dt`)) AS `series`,(@year) AS `year`,(@year:=year(`20110218_series`.`sales`.`dt`)) AS `@year := YEAR(dt)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') order by `20110218_series`.`sales`.`dt`) `q` where (`q`.`series` = 0) union all select cast((@year) as unsigned) AS `CAST(@year AS UNSIGNED)`,cast((@sum) as decimal(20,2)) AS `CAST(@sum AS DECIMAL(20, 2))`
We see that though inline view support is pretty inefficient in MySQL and the resulting query is not as fast as it should be, there is no using temporary
and the query is more efficient. With a native support for SERIES()
, it would be yet more fast.
Conclusion
A window function (or an extension for GROUP BY
clause) that would allow grouping continuous series of an expression in an ordered dataset would ease writing queries that need to aggregate such series, make them more efficient and even improve certain kinds of grouping queries where the grouping expression shares its order with one of the indexes.
Update of Feb 21 2011
Another nice solution proposed by @delostilos:
SELECT MIN(source), series, SUM(value) FROM ( SELECT *, SUM(COALESCE((source <> ns)::INTEGER, 0)) OVER (ORDER BY id) AS series FROM ( SELECT series.*, LAG(source) OVER (ORDER BY id) AS ns FROM ( VALUES (1, 1, 10), (2, 1, 20), (3, 2, 15), (4, 2, 25), (5, 1, 45), (6, 3, 50), (7, 3, 35), (8, 1, 40), (9, 1, 10) ) series (id, source, value) ) q ) q GROUP BY series
min | series | sum | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 40 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 2 | 45 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 3 | 85 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 4 | 50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 0 | 30 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0005s (0.0030s) |
HashAggregate (cost=0.84..0.98 rows=9 width=16) -> WindowAgg (cost=0.26..0.68 rows=9 width=16) -> WindowAgg (cost=0.26..0.41 rows=9 width=12) -> Sort (cost=0.26..0.28 rows=9 width=12) Sort Key: "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.11 rows=9 width=12)
This is more efficient than double ROW_NUMBER
, since the engine only needs to sort on the id
expression at most once (or does not need to sort at all if it comes sorted from the previous recordset or from the index).
However, the final GROUP BY
still cannot use sort aggregation without additional sorting, since the output of the analytic SUM()
cannot be assumed to have the same order as id
. That's why the native support from the system would still be an improvement.
Regarding “Grouping on expressions sharing the order” in MySQL — if you use “ORDER BY NULL”, you can prevent the GROUP BY from automatically trying to sort, thus saving the filesort. EXPLAIN easy confirms this fact.
Fenway
19 Feb 11 at 05:46
@Fenway: nice point, thanks. Actually, it’s
using temporary
that takes the time.Quassnoi
19 Feb 11 at 17:08
@Quassnoi: well, yes, GROUPing by an expression with do that; but you mentioned filesort specifically — hence the comment. By the way, this is one of the best database blogs out there — I’m an avid reader.
Fenway
19 Feb 11 at 21:45
@Fenway: sure, fixed this now. Thanks!
Quassnoi
19 Feb 11 at 22:58
Hi,
I did try to post earlier, but I think it didn’t came thru. With two window functions You can do the same as in the MySQL function. Use the LAG() window funtion to compare the current ‘source’ with the previous and set the ‘newgroup’ column to 1 if they differ. Then use the ‘newgroup’ in a moving/running sum, the result of the running sum will be the group number. Both the window functions are ordered on ‘id’, so it can skip one sort action.
Regards,
JJ.
delostilos
21 Feb 11 at 03:32
@delostilos: that’s a nice solution too, thanks! However, a support from the system would be still a bonus since the system does not know that the analytic
SUM
has the same order asid
and hence cannot useSORT GROUP BY
without additional sorting.Quassnoi
21 Feb 11 at 14:17
Hi,
Would it be possible in PostgreSQL to create an user defined aggregate series() that combines the lag and the sum?
Regards,
JJ.
delostilos
28 Feb 11 at 01:13
@delostilos: yes it is possible to create such a function (it would be window, not aggregate and hence should be written as a linked library). However, I doubt it is possible to convince PostgreSQL that is preserves the order.
Quassnoi
28 Feb 11 at 13:48
Just read “MySQL ORDER BY / LIMIT performance: late row lookups”; something of a revelation. Good stuff!
Stevie
10 Mar 11 at 13:16