SQL Server: running totals
From Stack Overflow:
We have a table of transactions which is structured like the following :
transactions TranxID ItemID TranxDate TranxAmt
TranxAmt
can be positive or negative, so the running total of this field (for anyItemID
) will go up and down as time goes by.Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the
TranxDate
when this occurred.Note that TranxDate is not unique.
SQL Server is a very nice system, but using it for calculating running totals is a pain.
Oracle supports additional clauses for analytic functions, RANGE
and ROWS
, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE
clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.
SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM
and COUNT
it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.
The common way to write such a running total query is using a subquery or a self join which would count the SUM
of all previous records. However, the complexity of this query is O(n^2)
and it's not usable for any real volumes of data.
This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.
This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.
Let's create a sample table:
Table creation details
There are 100,000 records with 5 ItemID
's (20,000 records per ItemID
). The dates and the values are random, with 5,000 possible dates.
Now, the idea is to develop a recursive query that would iterate though dates and on each date, add the sum of that date's transactions.
Here's how we do it (in steps):
Grouping the records
The first thing we need to go is group the records by ItemID
and TransDate
:
WITH q AS ( SELECT ItemID, TranxDate, SUM(TranxAmt) AS TranxSum FROM [20100122_running].t_transaction GROUP BY ItemID, TranxDate )
This kills two birds: first, this is a part of a query we will need to obtain the seed values to begin recursion with, second, we get a set of records that are unique on (ItemID, TranxDate)
(this will be required a little bit later)
Retrieving the seed values
A recursive CTE should have an anchor part which is a base for recursion. To get the initial values of dates and sums we need to select the records holding the minimal values of TranxDate
for each ItemID
. The link above describes two approaches to that in more detail, and here I'll just proved the query:
WITH q AS ( SELECT ItemID, TranxDate, SUM(TranxAmt) AS TranxSum FROM [20100122_running].t_transaction GROUP BY ItemID, TranxDate ), SELECT qo.ItemID, LastDate, TranxDate, TranxSum FROM ( SELECT ItemID, MAX(TranxDate) AS LastDate FROM q GROUP BY ItemID ) qo CROSS APPLY ( SELECT TOP 1 TranxDate, TranxSum FROM q qi WHERE qi.ItemID = qo.ItemId ORDER BY TranxDate ) qn )
Note that I use GROUP BY
instead of DISTINCT
in the subquery that selects the ItemID
's. This allows to select the value of MAX(TranxDate)
along with each ItemID
which could be used as a stop condition in the recursive CTE.
Calculating the running SUM
To calculate the running SUM
we need to run the query recursively, adding one day to TranxDate
until it reaches LastDate
(which we selected with the previous query).
The value of the running SUM
will be that of the previous day plus the total amount of transactions for the current day. Unfortunately, due to some limitations of SQL Server's implementation of recursive CTE's, we can use neither LEFT JOIN
nor aggregate functions (like SUM
) in a recursive part of the CTE.
That's why we should use precalculated SUM
's for the current (ItemID, TranxDate)
(which is one of the reasons for pregrouping the records that I mentioned earlier in the article), and retrieve its value using a scalar subquery instead of a LEFT JOIN
:
WITH q AS ( SELECT ItemID, TranxDate, SUM(TranxAmt) AS TranxSum FROM [20100122_running].t_transaction GROUP BY ItemID, TranxDate ), m AS ( SELECT qo.ItemID, LastDate, TranxDate, TranxSum FROM ( SELECT ItemID, MAX(TranxDate) AS LastDate FROM q GROUP BY ItemID ) qo CROSS APPLY ( SELECT TOP 1 TranxDate, TranxSum FROM q qi WHERE qi.ItemID = qo.ItemId ORDER BY TranxDate ) qn UNION ALL SELECT ItemId, LastDate, DATEADD(day, 1, m.TranxDate), m.TranxSum + COALESCE( ( SELECT TranxSum FROM q WHERE q.ItemID = m.ItemId AND q.TranxDate = DATEADD(day, 1, m.TranxDate) ), 0) FROM m WHERE m.TranxDate < m.LastDate )
Since there are neither a LEFT JOIN
nor a SUM
in the recursive part of the query, SQL Server allows this construct.
If we were to SELECT
from this CTE we would get a list of running sums for each ItemID
.
Selecting greatest running sums
All we have to do now is to select records holding a greatest running sum per ItemID
.
This article (which I already mentioned above) describes two methods for doing this in SQL Server.
We already used one of these methods, the one which uses DISTINCT / CROSS APPLY
. For that task it was better since we needed a GROUP BY
anyway to get the LastDate
.
However, for getting the records holding greatest running sums, we will use another method, which just filters on partitioned ROW_NUMBER
. This method is better, since the query in question is quite a complex CTE. SQL Server does not usually materialize the CTE's and reevaluating it each time in a CROSS APPLY
would be quite expensive.
The ROW_NUMBER
, on the contrary, can be retrieved and filtered on with a single pass over the CTE's resultset.
Here's how we do it:
WITH q AS ( SELECT ItemID, TranxDate, SUM(TranxAmt) AS TranxSum FROM [20100122_running].t_transaction GROUP BY ItemID, TranxDate ), m AS ( SELECT qo.ItemID, LastDate, TranxDate, TranxSum FROM ( SELECT ItemID, MAX(TranxDate) AS LastDate FROM q GROUP BY ItemID ) qo CROSS APPLY ( SELECT TOP 1 TranxDate, TranxSum FROM q qi WHERE qi.ItemID = qo.ItemId ORDER BY TranxDate ) qn UNION ALL SELECT ItemId, LastDate, DATEADD(day, 1, m.TranxDate), m.TranxSum + COALESCE( ( SELECT TranxSum FROM q WHERE q.ItemID = m.ItemId AND q.TranxDate = DATEADD(day, 1, m.TranxDate) ), 0) FROM m WHERE m.TranxDate < m.LastDate ) SELECT ItemID, TranxDate, TranxSum FROM ( SELECT m.*, ROW_NUMBER() OVER (PARTITION BY ItemId ORDER BY TranxSum DESC, TranxDate) AS rn FROM m ) qp WHERE rn = 1 OPTION (MAXRECURSION 0)
ItemID | TranxDate | TranxSum | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1999-10-26 00:00:00.000 | 406.0530 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 2007-07-08 00:00:00.000 | 5594.7071 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 2009-06-04 00:00:00.000 | 4903.4597 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 1996-10-14 00:00:00.000 | 22.3472 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | 2008-06-18 00:00:00.000 | 247.2859 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0004s (1.0937s) |
Table 'Worktable'. Scan count 2, logical reads 149994, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_transaction'. Scan count 49534, logical reads 151549, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1016 ms, elapsed time = 1101 ms.
|--Filter(WHERE:([Expr1029]=(1))) |--Sequence Project(DEFINE:([Expr1029]=row_number)) |--Compute Scalar(DEFINE:([Expr1041]=(1))) |--Segment |--Sort(ORDER BY:([Recr1025] ASC, [Recr1028] DESC, [Recr1027] ASC)) |--Index Spool(WITH STACK) |--Concatenation |--Compute Scalar(DEFINE:([Expr1036]=(0))) | |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20100122_running].[t_transaction].[ItemID])) | |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID]) DEFINE:([Expr1004]=MAX([test].[20100122_running].[t_transaction].[TranxDate]))) | | |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[ItemID], [test].[20100122_running].[t_transaction].[TranxDate])) | | |--Index Scan(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), ORDERED FORWARD) | |--Top(TOP EXPRESSION:((1))) | |--Stream Aggregate(GROUP BY:([test].[20100122_running].[t_transaction].[TranxDate]) DEFINE:([Expr1008]=SUM([test].[20100122_running].[t_transaction].[TranxAmt]))) | |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[test].[20100122_running].[t_transaction].[ItemID]) ORDERED FORWARD) |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1038], [Recr1009], [Recr1010], [Recr1011], [Recr1012])) |--Compute Scalar(DEFINE:([Expr1038]=[Expr1037]+(1))) | |--Table Spool(WITH STACK) |--Compute Scalar(DEFINE:([Expr1013]=dateadd(day,(1),[Recr1011]), [Expr1024]=[Recr1012]+CASE WHEN [Expr1017] IS NOT NULL THEN [Expr1022] ELSE ($0.0000) END)) |--Nested Loops(Left Outer Join, PASSTHRU:(IsFalseOrNull [Expr1017] IS NOT NULL)) |--Nested Loops(Left Outer Join) | |--Filter(WHERE:(STARTUP EXPR([Recr1011]<[Recr1010]))) | | |--Constant Scan | |--Stream Aggregate(DEFINE:([Expr1017]=SUM([test].[20100122_running].[t_transaction].[TranxAmt]))) | |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD) |--Stream Aggregate(DEFINE:([Expr1022]=SUM([test].[20100122_running].[t_transaction].[TranxAmt]))) |--Index Seek(OBJECT:([test].[20100122_running].[t_transaction].[IX_transaction_date__amt]), SEEK:([test].[20100122_running].[t_transaction].[ItemID]=[Recr1009] AND [test].[20100122_running].[t_transaction].[TranxDate]=dateadd(day,(1),[Recr1011])) ORDERED FORWARD)
The query completes in 1 second which is very fast.
It performs much better than a cursor-based solution and of course very much better than SELECT
-level subqueries or self-joins that calculate the same sums of the same rows over and over and over again.
Social comments and analytics for this post…
This post was mentioned on Twitter by MrPaladin: How to achieve acumulating totals in sqlServer #sql http://is.gd/7qkBq…
uberVU - social comments
2 Feb 10 at 21:48