Things SQL needs: SERIES() at EXPLAIN EXTENDED

Things SQL needs: SERIES()

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id	source	value
1	1	10
2	1	20
3	2	15
4	2	25
5	1	45
6	3	50
7	3	35
8	1	40
9	1	10

The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:

source	min	max	sum	avg
1	10	20	30	15.00
2	15	25	40	20.00
1	45	45	45	45.00
3	35	50	85	42.50
1	10	40	50	25.00

This can be used for different things. I used that for:

Reading sensors from a moving elevator (thus tracking its position)
Recording user's activity on a site
Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of source are repeating so a mere GROUP BY won't work here.

In the systems supporting window functions there is a workaround for that:

SELECT  source, MIN(value), MAX(value), SUM(value), AVG(value)::NUMERIC(20, 2)
FROM    (
SELECT  *,
ROW_NUMBER() OVER (PARTITION BY source ORDER BY id) AS rno,
ROW_NUMBER() OVER (ORDER BY id) AS rne
FROM    (
VALUES
(1, 1, 10),
(2, 1, 20),
(3, 2, 15),
(4, 2, 25),
(5, 1, 45),
(6, 3, 50),
(7, 3, 35),
(8, 1, 40),
(9, 1, 10)
) series (id, source, value)
) q
GROUP BY
source, rne - rno
ORDER BY
MIN(id)

source	min	max	sum	avg
1	10	20	30	15.00
2	15	25	40	20.00
1	45	45	45	45.00
3	35	50	85	42.50
1	10	40	50	25.00
5 rows fetched in 0.0004s (0.0033s)

Sort  (cost=1.40..1.42 rows=9 width=28)
  Sort Key: (min(q.id))
  ->  HashAggregate  (cost=1.01..1.25 rows=9 width=28)
        ->  Subquery Scan q  (cost=0.58..0.85 rows=9 width=28)
              ->  WindowAgg  (cost=0.58..0.74 rows=9 width=12)
                    ->  Sort  (cost=0.58..0.60 rows=9 width=12)
                          Sort Key: "*VALUES*".column1
                          ->  WindowAgg  (cost=0.26..0.44 rows=9 width=12)
                                ->  Sort  (cost=0.26..0.28 rows=9 width=12)
                                      Sort Key: "*VALUES*".column2, "*VALUES*".column1
                                      ->  Values Scan on "*VALUES*"  (cost=0.00..0.11 rows=9 width=12)

(this is PostgreSQL).

The idea behind this solution is that the overall order of id is retained within each of the series but is broken whenever the series break. Thus, a difference between the overall row number and the source-wise row number is an invariant within each series and is guaranteed to change as the series break. Hence, we can GROUP BY on it (along with the source).

This workaround is nice and elegant, however it requires three sorts and one hash aggregate (apart from the support for window functions of course).

SERIES function

My proposal is to implement a certain extension to the GROUP BY clause, namely:

GROUP BY SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)

The logic is quite simple: it would find continuous blocks of series_expression as if the records were ordered by series_ordering_expression and assign them to the same group. Within this group, all standard aggregation functions could be supported.

The expression:

SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)

could also serve as a normal window function in the systems that support it. In this case, it would return the ordinal number of the series within each partition.

Implementation

Its implementation would be quite simple: the records sharing the same value of partitioning_expression should be ordered on ordering_expression, a zero-based (or one-based) counter a variable holding the previous value of the series_expression should be set up. Whenever the value of the series_expression changes, the counter is incremented (and returned as an output of SERIES) and the variable gets the new value of series_expression.

Using the session variables, this can be easily emulated in MySQL:

SELECT  source,
MIN(value) AS min,
MAX(value) AS max,
SUM(value) AS sum,
CAST(AVG(value) AS DECIMAL(20, 2)) AS avg
FROM    (
SELECT  @series := @series + (COALESCE(@source &lt;&gt; source, 0)) AS series,
@source := source AS newsource,
q.*
FROM    (
SELECT  @series := 1, @source := NULL
) vars
STRAIGHT_JOIN
(
SELECT  1 AS id, 1 AS source, 10 AS value
UNION ALL
SELECT  2 AS id, 1 AS source, 20 AS value
UNION ALL
SELECT  3 AS id, 2 AS source, 15 AS value
UNION ALL
SELECT  4 AS id, 2 AS source, 25 AS value
UNION ALL
SELECT  5 AS id, 1 AS source, 45 AS value
UNION ALL
SELECT  6 AS id, 3 AS source, 50 AS value
UNION ALL
SELECT  7 AS id, 3 AS source, 35 AS value
UNION ALL
SELECT  8 AS id, 1 AS source, 40 AS value
UNION ALL
SELECT  9 AS id, 1 AS source, 10 AS value
) q
ORDER BY
id
) q
GROUP BY
series

source	min	max	sum	avg
1	10	20	30	15.00
2	15	25	40	20.00
1	45	45	45	45.00
3	35	50	85	42.50
1	10	40	50	25.00
5 rows fetched in 0.0004s (0.0026s)

id	select_type	table	type	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL	9	100.00	Using temporary; Using filesort
2	DERIVED	<derived3>	system	1	100.00	Using filesort
2	DERIVED	<derived4>	ALL	9	100.00
4	DERIVED					No tables used
5	UNION					No tables used
6	UNION					No tables used
7	UNION					No tables used
8	UNION					No tables used
9	UNION					No tables used
10	UNION					No tables used
11	UNION					No tables used
12	UNION					No tables used
	UNION RESULT	<union4,5,6,7,8,9,10,11,12>	ALL
3	DERIVED					No tables used

select `q`.`source` AS `source`,min(`q`.`value`) AS `min`,max(`q`.`value`) AS `max`,sum(`q`.`value`) AS `sum`,cast(avg(`q`.`value`) as decimal(20,2)) AS `avg` from (select (@series:=((@series) + coalesce(((@source) <> `q`.`source`),0))) AS `series`,(@source:=`q`.`source`) AS `newsource`,`q`.`id` AS `id`,`q`.`source` AS `source`,`q`.`value` AS `value` from (select (@series:=1) AS `@series := 1`,(@source:=NULL) AS `@source := NULL`) `vars` straight_join (select 1 AS `id`,1 AS `source`,10 AS `value` union all select 2 AS `id`,1 AS `source`,20 AS `value` union all select 3 AS `id`,2 AS `source`,15 AS `value` union all select 4 AS `id`,2 AS `source`,25 AS `value` union all select 5 AS `id`,1 AS `source`,45 AS `value` union all select 6 AS `id`,3 AS `source`,50 AS `value` union all select 7 AS `id`,3 AS `source`,35 AS `value` union all select 8 AS `id`,1 AS `source`,40 AS `value` union all select 9 AS `id`,1 AS `source`,10 AS `value`) `q` order by `q`.`id`) `q` group by `q`.`series`

, with only one filesort (which could be avoided if the values were ordered with an index).

Grouping on expressions sharing the order

This clause would also allow running certain grouping queries more efficiently.

Say, we need to build a yearly report for sales on a web site. Let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE sales (
id INT NOT NULL,
dt DATETIME NOT NULL,
amount DECIMAL(20, 2) NOT NULL
) ENGINE=InnoDB;

CREATE INDEX
ix_sales_dt_amount
ON      sales (dt, amount);

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    sales
SELECT  id,
        CAST('2011-02-18' AS DATE) - INTERVAL id * 10 MINUTE,
        (10000 + CEILING(RAND(20110218) * 10000)) / 100
FROM    filler;

Normally, we do it with this query:

SELECT  YEAR(dt), SUM(amount)
FROM    sales
WHERE   dt &gt;= '2005-01-01'
GROUP BY
YEAR(dt)
ORDER BY
NULL

YEAR(dt)	SUM(amount)
2005	7885399.58
2006	7888231.80
2007	7881056.65
2008	7894880.23
2009	7879728.51
2010	7884989.40
2011	1034590.31
7 rows fetched in 0.0002s (0.7834s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	sales	range	ix_sales_dt_amount	ix_sales_dt_amount	8		250224	100.00	Using where; Using index; Using temporary

select year(`20110218_series`.`sales`.`dt`) AS `YEAR(dt)`,sum(`20110218_series`.`sales`.`amount`) AS `SUM(amount)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') group by year(`20110218_series`.`sales`.`dt`) order by NULL

The query is pretty fast, but we see a using temporary in the plan. Since the order of of dt is always that of YEAR(dt), and the records are selected from the index on dt (which naturally come ordered), the temporary table is redundant. A sort grouping can be used instead.

We have no means to tell MySQL (or any other database system) to tell that the orders match, but with the extension proposed, we could do just that:

SELECT  YEAR(MIN(dt)), SUM(amount)
FROM    sales
WHERE   dt &gt;= '2005-01-01'
GROUP BY
SERIES(YEAR(dt)) OVER (ORDER BY dt)

Since YEAR(dt) will always form continuous series being ordered by dt, this is the same as the query above, but filesort could be avoided.

Let's emulate this using session variables:

SELECT  CAST(year AS UNSIGNED), CAST(sum_amount AS DECIMAL(20, 2))
FROM    (
SELECT  @sum := 0,
@year := YEAR(MIN(dt))
FROM    sales
WHERE   dt &gt;= '2005-01-01'
) vars
STRAIGHT_JOIN
(
SELECT  @sum AS sum_amount,
@sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END,
@year = YEAR(dt) AS series,
@year AS year,
@year := YEAR(dt)
FROM    sales
WHERE   dt &gt;= '2005-01-01'
ORDER BY
dt
) q
WHERE   series = 0
UNION ALL
SELECT  CAST(@year AS UNSIGNED), CAST(@sum AS DECIMAL(20, 2))

CAST(year AS UNSIGNED)	CAST(sum_amount AS DECIMAL(20, 2))
2005	7885399.58
2006	7888231.80
2007	7881056.65
2008	7894880.23
2009	7879728.51
2010	7884989.40
2011	1034590.31
7 rows fetched in 0.0002s (0.5375s)

id	select_type	table	type	possible_keys	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	system				1	100.00
1	PRIMARY	<derived3>	ALL				322416	100.00	Using where
3	DERIVED	sales	range	ix_sales_dt_amount	ix_sales_dt_amount	8	250224	100.00	Using where; Using index
2	DERIVED								Select tables optimized away
4	UNION								No tables used
	UNION RESULT	<union1,4>	ALL

select cast(`q`.`year` as unsigned) AS `CAST(year AS UNSIGNED)`,cast(`q`.`sum_amount` as decimal(20,2)) AS `CAST(sum_amount AS DECIMAL(20, 2))` from (select (@sum:=0) AS `@sum := 0`,(@year:=year(min(`20110218_series`.`sales`.`dt`))) AS `@year := YEAR(MIN(dt))` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01')) `vars` straight_join (select (@sum) AS `sum_amount`,(@sum:=(case when ((@year) = year(`20110218_series`.`sales`.`dt`)) then ((@sum) + `20110218_series`.`sales`.`amount`) else `20110218_series`.`sales`.`amount` end)) AS `@sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END`,((@year) = year(`20110218_series`.`sales`.`dt`)) AS `series`,(@year) AS `year`,(@year:=year(`20110218_series`.`sales`.`dt`)) AS `@year := YEAR(dt)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') order by `20110218_series`.`sales`.`dt`) `q` where (`q`.`series` = 0) union all select cast((@year) as unsigned) AS `CAST(@year AS UNSIGNED)`,cast((@sum) as decimal(20,2)) AS `CAST(@sum AS DECIMAL(20, 2))`

We see that though inline view support is pretty inefficient in MySQL and the resulting query is not as fast as it should be, there is no using temporary and the query is more efficient. With a native support for SERIES(), it would be yet more fast.

Conclusion

A window function (or an extension for GROUP BY clause) that would allow grouping continuous series of an expression in an ordered dataset would ease writing queries that need to aggregate such series, make them more efficient and even improve certain kinds of grouping queries where the grouping expression shares its order with one of the indexes.

Update of Feb 21 2011

Another nice solution proposed by @delostilos:

SELECT  MIN(source), series, SUM(value)
FROM    (
SELECT  *,
SUM(COALESCE((source &lt;&gt; ns)::INTEGER, 0)) OVER (ORDER BY id) AS series
FROM    (
SELECT  series.*,
LAG(source) OVER (ORDER BY id) AS ns
FROM    (
VALUES
(1, 1, 10),
(2, 1, 20),
(3, 2, 15),
(4, 2, 25),
(5, 1, 45),
(6, 3, 50),
(7, 3, 35),
(8, 1, 40),
(9, 1, 10)
) series (id, source, value)
) q
) q
GROUP BY
series

min	series	sum
2	1	40
1	2	45
3	3	85
1	4	50
1	0	30
5 rows fetched in 0.0005s (0.0030s)

HashAggregate  (cost=0.84..0.98 rows=9 width=16)
  ->  WindowAgg  (cost=0.26..0.68 rows=9 width=16)
        ->  WindowAgg  (cost=0.26..0.41 rows=9 width=12)
              ->  Sort  (cost=0.26..0.28 rows=9 width=12)
                    Sort Key: "*VALUES*".column1
                    ->  Values Scan on "*VALUES*"  (cost=0.00..0.11 rows=9 width=12)

This is more efficient than double ROW_NUMBER, since the engine only needs to sort on the id expression at most once (or does not need to sort at all if it comes sorted from the previous recordset or from the index).

However, the final GROUP BY still cannot use sort aggregation without additional sorting, since the output of the analytic SUM() cannot be assumed to have the same order as id. That's why the native support from the system would still be an improvement.

Written by Quassnoi

February 18th, 2011 at 11:00 pm

Posted in Miscellaneous

9 Responses to 'Things SQL needs: SERIES()'

Subscribe to comments with RSS

Regarding “Grouping on expressions sharing the order” in MySQL — if you use “ORDER BY NULL”, you can prevent the GROUP BY from automatically trying to sort, thus saving the filesort. EXPLAIN easy confirms this fact.

Fenway

19 Feb 11 at 05:46
@Fenway: nice point, thanks. Actually, it’s using temporary that takes the time.

Quassnoi

19 Feb 11 at 17:08
@Quassnoi: well, yes, GROUPing by an expression with do that; but you mentioned filesort specifically — hence the comment. By the way, this is one of the best database blogs out there — I’m an avid reader.

Fenway

19 Feb 11 at 21:45
@Fenway: sure, fixed this now. Thanks!

Quassnoi

19 Feb 11 at 22:58
Hi,

I did try to post earlier, but I think it didn’t came thru. With two window functions You can do the same as in the MySQL function. Use the LAG() window funtion to compare the current ‘source’ with the previous and set the ‘newgroup’ column to 1 if they differ. Then use the ‘newgroup’ in a moving/running sum, the result of the running sum will be the group number. Both the window functions are ordered on ‘id’, so it can skip one sort action.

Regards,
JJ.

delostilos

21 Feb 11 at 03:32
@delostilos: that’s a nice solution too, thanks! However, a support from the system would be still a bonus since the system does not know that the analytic SUM has the same order as id and hence cannot use SORT GROUP BY without additional sorting.

Quassnoi

21 Feb 11 at 14:17
Hi,

Would it be possible in PostgreSQL to create an user defined aggregate series() that combines the lag and the sum?

Regards,
JJ.

delostilos

28 Feb 11 at 01:13
@delostilos: yes it is possible to create such a function (it would be window, not aggregate and hence should be written as a linked library). However, I doubt it is possible to convince PostgreSQL that is preserves the order.

Quassnoi

28 Feb 11 at 13:48
Just read “MySQL ORDER BY / LIMIT performance: late row lookups”; something of a revelation. Good stuff!

Stevie

10 Mar 11 at 13:16

EXPLAIN EXTENDED