EXPLAIN EXTENDED

How to create fast database queries

SQL Server: running totals

with one comment

From Stack Overflow:

We have a table of transactions which is structured like the following :

transactions
TranxID ItemID TranxDate TranxAmt

TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by.

Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the TranxDate when this occurred.

Note that TranxDate is not unique.

SQL Server is a very nice system, but using it for calculating running totals is a pain.

Oracle supports additional clauses for analytic functions, RANGE and ROWS, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.

SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM and COUNT it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.

The common way to write such a running total query is using a subquery or a self join which would count the SUM of all previous records. However, the complexity of this query is O(n^2) and it's not usable for any real volumes of data.

This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.

This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.

Let's create a sample table:

Table creation details

There are 100,000 records with 5 ItemID's (20,000 records per ItemID). The dates and the values are random, with 5,000 possible dates.

Now, the idea is to develop a recursive query that would iterate though dates and on each date, add the sum of that date's transactions.

Here's how we do it (in steps):

Grouping the records

The first thing we need to go is group the records by ItemID and TransDate:

WITH    q AS
(
SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
FROM    [20100122_running].t_transaction
GROUP BY
ItemID, TranxDate
)

This kills two birds: first, this is a part of a query we will need to obtain the seed values to begin recursion with, second, we get a set of records that are unique on (ItemID, TranxDate) (this will be required a little bit later)

Retrieving the seed values

A recursive CTE should have an anchor part which is a base for recursion. To get the initial values of dates and sums we need to select the records holding the minimal values of TranxDate for each ItemID. The link above describes two approaches to that in more detail, and here I'll just proved the query:

WITH    q AS
(
SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
FROM    [20100122_running].t_transaction
GROUP BY
ItemID, TranxDate
),
SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
FROM    (
SELECT  ItemID, MAX(TranxDate) AS LastDate
FROM    q
GROUP BY
ItemID
) qo
CROSS APPLY
(
SELECT  TOP 1 TranxDate, TranxSum
FROM    q qi
WHERE   qi.ItemID = qo.ItemId
ORDER BY
TranxDate
) qn
)

Note that I use GROUP BY instead of DISTINCT in the subquery that selects the ItemID's. This allows to select the value of MAX(TranxDate) along with each ItemID which could be used as a stop condition in the recursive CTE.

Calculating the running SUM

To calculate the running SUM we need to run the query recursively, adding one day to TranxDate until it reaches LastDate (which we selected with the previous query).

The value of the running SUM will be that of the previous day plus the total amount of transactions for the current day. Unfortunately, due to some limitations of SQL Server's implementation of recursive CTE's, we can use neither LEFT JOIN nor aggregate functions (like SUM) in a recursive part of the CTE.

That's why we should use precalculated SUM's for the current (ItemID, TranxDate) (which is one of the reasons for pregrouping the records that I mentioned earlier in the article), and retrieve its value using a scalar subquery instead of a LEFT JOIN:

WITH    q AS
(
SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
FROM    [20100122_running].t_transaction
GROUP BY
ItemID, TranxDate
),
m AS
(
SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
FROM    (
SELECT  ItemID, MAX(TranxDate) AS LastDate
FROM    q
GROUP BY
ItemID
) qo
CROSS APPLY
(
SELECT  TOP 1 TranxDate, TranxSum
FROM    q qi
WHERE   qi.ItemID = qo.ItemId
ORDER BY
TranxDate
) qn
UNION ALL
SELECT  ItemId, LastDate, DATEADD(day, 1, m.TranxDate),
m.TranxSum +
COALESCE(
(
SELECT  TranxSum
FROM    q
WHERE   q.ItemID = m.ItemId
AND q.TranxDate = DATEADD(day, 1, m.TranxDate)
), 0)
FROM    m
WHERE   m.TranxDate < m.LastDate
        )

Since there are neither a LEFT JOIN nor a SUM in the recursive part of the query, SQL Server allows this construct.

If we were to SELECT from this CTE we would get a list of running sums for each ItemID.

Selecting greatest running sums

All we have to do now is to select records holding a greatest running sum per ItemID.

This article (which I already mentioned above) describes two methods for doing this in SQL Server.

We already used one of these methods, the one which uses DISTINCT / CROSS APPLY. For that task it was better since we needed a GROUP BY anyway to get the LastDate.

However, for getting the records holding greatest running sums, we will use another method, which just filters on partitioned ROW_NUMBER. This method is better, since the query in question is quite a complex CTE. SQL Server does not usually materialize the CTE's and reevaluating it each time in a CROSS APPLY would be quite expensive.

The ROW_NUMBER, on the contrary, can be retrieved and filtered on with a single pass over the CTE's resultset.

Here's how we do it:

WITH    q AS
(
SELECT  ItemID, TranxDate, SUM(TranxAmt) AS TranxSum
FROM    [20100122_running].t_transaction
GROUP BY
ItemID, TranxDate
),
m AS
(
SELECT  qo.ItemID, LastDate, TranxDate, TranxSum
FROM    (
SELECT  ItemID, MAX(TranxDate) AS LastDate
FROM    q
GROUP BY
ItemID
) qo
CROSS APPLY
(
SELECT  TOP 1 TranxDate, TranxSum
FROM    q qi
WHERE   qi.ItemID = qo.ItemId
ORDER BY
TranxDate
) qn
UNION ALL
SELECT  ItemId, LastDate, DATEADD(day, 1, m.TranxDate),
m.TranxSum +
COALESCE(
(
SELECT  TranxSum
FROM    q
WHERE   q.ItemID = m.ItemId
AND q.TranxDate = DATEADD(day, 1, m.TranxDate)
), 0)
FROM    m
WHERE   m.TranxDate < m.LastDate
        )
SELECT  ItemID, TranxDate, TranxSum
FROM    (
        SELECT  m.*, ROW_NUMBER() OVER (PARTITION BY ItemId ORDER BY TranxSum DESC, TranxDate) AS rn
        FROM    m
        ) qp
WHERE   rn = 1
OPTION (MAXRECURSION 0)
ItemID TranxDate TranxSum
1 1999-10-26 00:00:00.000 406.0530
2 2007-07-08 00:00:00.000 5594.7071
3 2009-06-04 00:00:00.000 4903.4597
4 1996-10-14 00:00:00.000 22.3472
5 2008-06-18 00:00:00.000 247.2859
5 rows fetched in 0.0004s (1.0937s)
Table 'Worktable'. Scan count 2, logical reads 149994, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_transaction'. Scan count 49534, logical reads 151549, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1016 ms,  elapsed time = 1101 ms. 
  |--Filter(WHERE:([Expr1029]=(1)))
       |--Sequence Project(DEFINE:([Expr1029]=row_number))
            |--Compute Scalar(DEFINE:([Expr1041]=(1)))
                 |--Segment
                      |--Sort(ORDER BY:([Recr1025] ASC, [Recr1028] DESC, [Recr1027] ASC))
                           |--Index Spool(WITH STACK)
                                |--Concatenation
                                     |--Compute Scalar(DEFINE:([Expr1036]=(0)))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20100122_running].[t_transaction].[ItemID]))
                                     |         |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID]) DEFINE:([Expr1004]=MAX([test].[20100122_running].[t_transaction].[TranxDate])))
                                     |         |    |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID], [test].[20100122_running].[t_transaction].[TranxDate]))
                                     |         |         |--Index Scan(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), ORDERED FORWARD)
                                     |         |--Top(TOP EXPRESSION:((1)))
                                     |              |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[TranxDate]) DEFINE:([Expr1008]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                     |                   |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[test].[20100122_running].[t_transaction].[ItemID]) ORDERED FORWARD)
                                     |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1038], [Recr1009], [Recr1010], [Recr1011], [Recr1012]))
                                          |--Compute Scalar(DEFINE:([Expr1038]=[Expr1037]+(1)))
                                          |    |--Table Spool(WITH STACK)
                                          |--Compute Scalar(DEFINE:([Expr1013]=dateadd(day,(1),[Recr1011]), [Expr1024]=[Recr1012]+CASE WHEN [Expr1017] IS NOT NULL THEN [Expr1022] ELSE ($0.0000) END))
                                               |--Nested Loops(Left Outer Join, PASSTHRU:(IsFalseOrNull [Expr1017] IS NOT NULL))
                                                    |--Nested Loops(Left Outer Join)
                                                    |    |--Filter(WHERE:(STARTUP EXPR([Recr1011]<[Recr1010])))
                                                    |    |    |--Constant Scan
                                                    |    |--Stream Aggregate(DEFINE:([Expr1017]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                                    |         |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD)
                                                    |--Stream Aggregate(DEFINE:([Expr1022]=SUM([test].[20100122_running].[t_transaction].[TranxAmt])))
                                                         |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD)

The query completes in 1 second which is very fast.

It performs much better than a cursor-based solution and of course very much better than SELECT-level subqueries or self-joins that calculate the same sums of the same rows over and over and over again.

Written by Quassnoi

January 22nd, 2010 at 11:00 pm

Posted in SQL Server

One Response to 'SQL Server: running totals'

Subscribe to comments with RSS

  1. Social comments and analytics for this post…

    This post was mentioned on Twitter by MrPaladin: How to achieve acumulating totals in sqlServer #sql http://is.gd/7qkBq

Leave a Reply