EXPLAIN EXTENDED

How to create fast database queries

Things SQL needs: SERIES()

with 9 comments

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id source value
1 1 10
2 1 20
3 2 15
4 2 25
5 1 45
6 3 50
7 3 35
8 1 40
9 1 10

The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:

source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00

This can be used for different things. I used that for:

  • Reading sensors from a moving elevator (thus tracking its position)
  • Recording user's activity on a site
  • Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of source are repeating so a mere GROUP BY won't work here.

In the systems supporting window functions there is a workaround for that:

SELECT  source, MIN(value), MAX(value), SUM(value), AVG(value)::NUMERIC(20, 2)
FROM    (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY source ORDER BY id) AS rno,
                ROW_NUMBER() OVER (ORDER BY id) AS rne
        FROM    (
                VALUES
                (1, 1, 10),
                (2, 1, 20),
                (3, 2, 15),
                (4, 2, 25),
                (5, 1, 45),
                (6, 3, 50),
                (7, 3, 35),
                (8, 1, 40),
                (9, 1, 10)
                ) series (id, source, value)
        ) q
GROUP BY
        source, rne - rno
ORDER BY
        MIN(id)
source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00
5 rows fetched in 0.0004s (0.0033s)
Sort  (cost=1.40..1.42 rows=9 width=28)
  Sort Key: (min(q.id))
  ->  HashAggregate  (cost=1.01..1.25 rows=9 width=28)
        ->  Subquery Scan q  (cost=0.58..0.85 rows=9 width=28)
              ->  WindowAgg  (cost=0.58..0.74 rows=9 width=12)
                    ->  Sort  (cost=0.58..0.60 rows=9 width=12)
                          Sort Key: "*VALUES*".column1
                          ->  WindowAgg  (cost=0.26..0.44 rows=9 width=12)
                                ->  Sort  (cost=0.26..0.28 rows=9 width=12)
                                      Sort Key: "*VALUES*".column2, "*VALUES*".column1
                                      ->  Values Scan on "*VALUES*"  (cost=0.00..0.11 rows=9 width=12)

(this is PostgreSQL).

The idea behind this solution is that the overall order of id is retained within each of the series but is broken whenever the series break. Thus, a difference between the overall row number and the source-wise row number is an invariant within each series and is guaranteed to change as the series break. Hence, we can GROUP BY on it (along with the source).

This workaround is nice and elegant, however it requires three sorts and one hash aggregate (apart from the support for window functions of course).

SERIES function

My proposal is to implement a certain extension to the GROUP BY clause, namely:

GROUP BY SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)

The logic is quite simple: it would find continuous blocks of series_expression as if the records were ordered by series_ordering_expression and assign them to the same group. Within this group, all standard aggregation functions could be supported.

The expression:

SERIES (series_expression) OVER ([PARTITION BY partitioning_expression] ORDER BY ordering_expression)

could also serve as a normal window function in the systems that support it. In this case, it would return the ordinal number of the series within each partition.

Implementation

Its implementation would be quite simple: the records sharing the same value of partitioning_expression should be ordered on ordering_expression, a zero-based (or one-based) counter a variable holding the previous value of the series_expression should be set up. Whenever the value of the series_expression changes, the counter is incremented (and returned as an output of SERIES) and the variable gets the new value of series_expression.

Using the session variables, this can be easily emulated in MySQL:

SELECT  source,
        MIN(value) AS min,
        MAX(value) AS max,
        SUM(value) AS sum,
        CAST(AVG(value) AS DECIMAL(20, 2)) AS avg
FROM    (
        SELECT  @series := @series + (COALESCE(@source <> source, 0)) AS series,
                @source := source AS newsource,
                q.*
        FROM    (
                SELECT  @series := 1, @source := NULL
                ) vars
        STRAIGHT_JOIN
                (
                SELECT  1 AS id, 1 AS source, 10 AS value
                UNION ALL
                SELECT  2 AS id, 1 AS source, 20 AS value
                UNION ALL
                SELECT  3 AS id, 2 AS source, 15 AS value
                UNION ALL
                SELECT  4 AS id, 2 AS source, 25 AS value
                UNION ALL
                SELECT  5 AS id, 1 AS source, 45 AS value
                UNION ALL
                SELECT  6 AS id, 3 AS source, 50 AS value
                UNION ALL
                SELECT  7 AS id, 3 AS source, 35 AS value
                UNION ALL
                SELECT  8 AS id, 1 AS source, 40 AS value
                UNION ALL
                SELECT  9 AS id, 1 AS source, 10 AS value
                ) q
        ORDER BY
                id
        ) q
GROUP BY
        series
source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00
5 rows fetched in 0.0004s (0.0026s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 9 100.00 Using temporary; Using filesort
2 DERIVED <derived3> system 1 100.00 Using filesort
2 DERIVED <derived4> ALL 9 100.00
4 DERIVED No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
8 UNION No tables used
9 UNION No tables used
10 UNION No tables used
11 UNION No tables used
12 UNION No tables used
UNION RESULT <union4,5,6,7,8,9,10,11,12> ALL
3 DERIVED No tables used
select `q`.`source` AS `source`,min(`q`.`value`) AS `min`,max(`q`.`value`) AS `max`,sum(`q`.`value`) AS `sum`,cast(avg(`q`.`value`) as decimal(20,2)) AS `avg` from (select (@series:=((@series) + coalesce(((@source) <> `q`.`source`),0))) AS `series`,(@source:=`q`.`source`) AS `newsource`,`q`.`id` AS `id`,`q`.`source` AS `source`,`q`.`value` AS `value` from (select (@series:=1) AS `@series := 1`,(@source:=NULL) AS `@source := NULL`) `vars` straight_join (select 1 AS `id`,1 AS `source`,10 AS `value` union all select 2 AS `id`,1 AS `source`,20 AS `value` union all select 3 AS `id`,2 AS `source`,15 AS `value` union all select 4 AS `id`,2 AS `source`,25 AS `value` union all select 5 AS `id`,1 AS `source`,45 AS `value` union all select 6 AS `id`,3 AS `source`,50 AS `value` union all select 7 AS `id`,3 AS `source`,35 AS `value` union all select 8 AS `id`,1 AS `source`,40 AS `value` union all select 9 AS `id`,1 AS `source`,10 AS `value`) `q` order by `q`.`id`) `q` group by `q`.`series`

, with only one filesort (which could be avoided if the values were ordered with an index).

Grouping on expressions sharing the order

This clause would also allow running certain grouping queries more efficiently.

Say, we need to build a yearly report for sales on a web site. Let's create a sample table:

Table creation details

Normally, we do it with this query:

SELECT  YEAR(dt), SUM(amount)
FROM    sales
WHERE   dt >= '2005-01-01'
GROUP BY
        YEAR(dt)
ORDER BY
        NULL

YEAR(dt) SUM(amount)
2005 7885399.58
2006 7888231.80
2007 7881056.65
2008 7894880.23
2009 7879728.51
2010 7884989.40
2011 1034590.31
7 rows fetched in 0.0002s (0.7834s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE sales range ix_sales_dt_amount ix_sales_dt_amount 8 250224 100.00 Using where; Using index; Using temporary
select year(`20110218_series`.`sales`.`dt`) AS `YEAR(dt)`,sum(`20110218_series`.`sales`.`amount`) AS `SUM(amount)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') group by year(`20110218_series`.`sales`.`dt`) order by NULL

The query is pretty fast, but we see a using temporary in the plan. Since the order of of dt is always that of YEAR(dt), and the records are selected from the index on dt (which naturally come ordered), the temporary table is redundant. A sort grouping can be used instead.

We have no means to tell MySQL (or any other database system) to tell that the orders match, but with the extension proposed, we could do just that:

SELECT  YEAR(MIN(dt)), SUM(amount)
FROM    sales
WHERE   dt >= '2005-01-01'
GROUP BY
        SERIES(YEAR(dt)) OVER (ORDER BY dt)

Since YEAR(dt) will always form continuous series being ordered by dt, this is the same as the query above, but filesort could be avoided.

Let's emulate this using session variables:

SELECT  CAST(year AS UNSIGNED), CAST(sum_amount AS DECIMAL(20, 2))
FROM    (
        SELECT  @sum := 0,
                @year := YEAR(MIN(dt))
        FROM    sales
        WHERE   dt >= '2005-01-01'
        ) vars
STRAIGHT_JOIN
        (
        SELECT  @sum AS sum_amount,
                @sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END,
                @year = YEAR(dt) AS series,
                @year AS year,
                @year := YEAR(dt)
        FROM    sales
        WHERE   dt >= '2005-01-01'
        ORDER BY
                dt
        ) q
WHERE   series = 0
UNION ALL
SELECT  CAST(@year AS UNSIGNED), CAST(@sum AS DECIMAL(20, 2))

CAST(year AS UNSIGNED) CAST(sum_amount AS DECIMAL(20, 2))
2005 7885399.58
2006 7888231.80
2007 7881056.65
2008 7894880.23
2009 7879728.51
2010 7884989.40
2011 1034590.31
7 rows fetched in 0.0002s (0.5375s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> system 1 100.00
1 PRIMARY <derived3> ALL 322416 100.00 Using where
3 DERIVED sales range ix_sales_dt_amount ix_sales_dt_amount 8 250224 100.00 Using where; Using index
2 DERIVED Select tables optimized away
4 UNION No tables used
UNION RESULT <union1,4> ALL
select cast(`q`.`year` as unsigned) AS `CAST(year AS UNSIGNED)`,cast(`q`.`sum_amount` as decimal(20,2)) AS `CAST(sum_amount AS DECIMAL(20, 2))` from (select (@sum:=0) AS `@sum := 0`,(@year:=year(min(`20110218_series`.`sales`.`dt`))) AS `@year := YEAR(MIN(dt))` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01')) `vars` straight_join (select (@sum) AS `sum_amount`,(@sum:=(case when ((@year) = year(`20110218_series`.`sales`.`dt`)) then ((@sum) + `20110218_series`.`sales`.`amount`) else `20110218_series`.`sales`.`amount` end)) AS `@sum := CASE WHEN @year = YEAR(dt) THEN @sum + amount ELSE amount END`,((@year) = year(`20110218_series`.`sales`.`dt`)) AS `series`,(@year) AS `year`,(@year:=year(`20110218_series`.`sales`.`dt`)) AS `@year := YEAR(dt)` from `20110218_series`.`sales` where (`20110218_series`.`sales`.`dt` >= '2005-01-01') order by `20110218_series`.`sales`.`dt`) `q` where (`q`.`series` = 0) union all select cast((@year) as unsigned) AS `CAST(@year AS UNSIGNED)`,cast((@sum) as decimal(20,2)) AS `CAST(@sum AS DECIMAL(20, 2))`

We see that though inline view support is pretty inefficient in MySQL and the resulting query is not as fast as it should be, there is no using temporary and the query is more efficient. With a native support for SERIES(), it would be yet more fast.

Conclusion

A window function (or an extension for GROUP BY clause) that would allow grouping continuous series of an expression in an ordered dataset would ease writing queries that need to aggregate such series, make them more efficient and even improve certain kinds of grouping queries where the grouping expression shares its order with one of the indexes.

Update of Feb 21 2011

Another nice solution proposed by @delostilos:

SELECT  MIN(source), series, SUM(value)
FROM    (
        SELECT  *,
                SUM(COALESCE((source <> ns)::INTEGER, 0)) OVER (ORDER BY id) AS series
        FROM    (
                SELECT  series.*,
                        LAG(source) OVER (ORDER BY id) AS ns
                FROM    (
                        VALUES
                        (1, 1, 10),
                        (2, 1, 20),
                        (3, 2, 15),
                        (4, 2, 25),
                        (5, 1, 45),
                        (6, 3, 50),
                        (7, 3, 35),
                        (8, 1, 40),
                        (9, 1, 10)
                        ) series (id, source, value)
                ) q
        ) q
GROUP BY
        series
min series sum
2 1 40
1 2 45
3 3 85
1 4 50
1 0 30
5 rows fetched in 0.0005s (0.0030s)
HashAggregate  (cost=0.84..0.98 rows=9 width=16)
  ->  WindowAgg  (cost=0.26..0.68 rows=9 width=16)
        ->  WindowAgg  (cost=0.26..0.41 rows=9 width=12)
              ->  Sort  (cost=0.26..0.28 rows=9 width=12)
                    Sort Key: "*VALUES*".column1
                    ->  Values Scan on "*VALUES*"  (cost=0.00..0.11 rows=9 width=12)

This is more efficient than double ROW_NUMBER, since the engine only needs to sort on the id expression at most once (or does not need to sort at all if it comes sorted from the previous recordset or from the index).

However, the final GROUP BY still cannot use sort aggregation without additional sorting, since the output of the analytic SUM() cannot be assumed to have the same order as id. That's why the native support from the system would still be an improvement.

Written by Quassnoi

February 18th, 2011 at 11:00 pm

Posted in Miscellaneous

9 Responses to 'Things SQL needs: SERIES()'

Subscribe to comments with RSS

  1. Regarding “Grouping on expressions sharing the order” in MySQL — if you use “ORDER BY NULL”, you can prevent the GROUP BY from automatically trying to sort, thus saving the filesort. EXPLAIN easy confirms this fact.

    Fenway

    19 Feb 11 at 05:46

  2. @Fenway: nice point, thanks. Actually, it’s using temporary that takes the time.

    Quassnoi

    19 Feb 11 at 17:08

  3. @Quassnoi: well, yes, GROUPing by an expression with do that; but you mentioned filesort specifically — hence the comment. By the way, this is one of the best database blogs out there — I’m an avid reader.

    Fenway

    19 Feb 11 at 21:45

  4. @Fenway: sure, fixed this now. Thanks!

    Quassnoi

    19 Feb 11 at 22:58

  5. Hi,

    I did try to post earlier, but I think it didn’t came thru. With two window functions You can do the same as in the MySQL function. Use the LAG() window funtion to compare the current ‘source’ with the previous and set the ‘newgroup’ column to 1 if they differ. Then use the ‘newgroup’ in a moving/running sum, the result of the running sum will be the group number. Both the window functions are ordered on ‘id’, so it can skip one sort action.

    Regards,
    JJ.

    delostilos

    21 Feb 11 at 03:32

  6. @delostilos: that’s a nice solution too, thanks! However, a support from the system would be still a bonus since the system does not know that the analytic SUM has the same order as id and hence cannot use SORT GROUP BY without additional sorting.

    Quassnoi

    21 Feb 11 at 14:17

  7. Hi,

    Would it be possible in PostgreSQL to create an user defined aggregate series() that combines the lag and the sum?

    Regards,
    JJ.

    delostilos

    28 Feb 11 at 01:13

  8. @delostilos: yes it is possible to create such a function (it would be window, not aggregate and hence should be written as a linked library). However, I doubt it is possible to convince PostgreSQL that is preserves the order.

    Quassnoi

    28 Feb 11 at 13:48

  9. Just read “MySQL ORDER BY / LIMIT performance: late row lookups”; something of a revelation. Good stuff!

    Stevie

    10 Mar 11 at 13:16

Leave a Reply