What’s UNPIVOT good for?
Answering questions asked on the site.
Karen asks:
… I've always thought
PIVOT
andUNPIVOT
are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?
I've made an actual use for them in a project I've been working on for the last several months (which is partly why there were no updates for so long!)
Part of the project is a task management system where each task has several persons related to it. There can be the creator of the task, the person the task is assigned to, the actual author of the task (on behalf of whom the task is created), and the task can be possible completed by a person or deleted by a person. A total of 5 fields related to persons.
Now, we need to take all tasks within a certain time range and list all people involved in them.
Let's create a sample table and see how would we do that.
There are 5 persons fields, timestamp and stuffing. The timestamp field is indexed.
Now, let's find all people involved in the tasks between 2011-06-30 and 2011-06-30 04:00:00. To do this, we could just use 5 queries (each selecting one of the persons) and UNION
them:
SELECT SUM(CAST(id AS BIGINT)), COUNT(*) FROM ( SELECT createdBy AS id FROM [20110630_unpivot].Task WHERE ts >= '2011-06-30' AND ts < '2011-06-30 04:00:00' UNION SELECT assignedTo AS id FROM [20110630_unpivot].Task WHERE ts >= '2011-06-30' AND ts < '2011-06-30 04:00:00' UNION SELECT onBehalfOf AS id FROM [20110630_unpivot].Task WHERE ts >= '2011-06-30' AND ts < '2011-06-30 04:00:00' UNION SELECT completedBy AS id FROM [20110630_unpivot].Task WHERE ts >= '2011-06-30' AND ts < '2011-06-30 04:00:00' AND completedBy IS NOT NULL UNION SELECT deletedBy AS id FROM [20110630_unpivot].Task WHERE ts >= '2011-06-30' AND ts < '2011-06-30 04:00:00' AND deletedBy IS NOT NULL ) q
2573160101 | 51331 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0009s (0.4251s) |
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 15, logical reads 220765, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server] SQL Server Execution Times: CPU time = 484 ms, elapsed time = 410 ms.
|--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [globalagg1019]=(0) THEN NULL ELSE [globalagg1021] END, [Expr1017]=CONVERT_IMPLICIT(int,[globalagg1023],0))) |--Stream Aggregate(DEFINE:([globalagg1019]=SUM([partialagg1018]), [globalagg1021]=SUM([partialagg1020]), [globalagg1023]=SUM([partialagg1022]))) |--Parallelism(Gather Streams) |--Compute Scalar(DEFINE:([partialagg1022]=[partialagg1018])) |--Stream Aggregate(DEFINE:([partialagg1018]=Count(*), [partialagg1020]=SUM(CONVERT(bigint,[Union1015],0)))) |--Hash Match(Aggregate, HASH:([Union1015]), RESIDUAL:([Union1015] = [Union1015])) |--Concatenation |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[createdBy])) | |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1025]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[assignedTo])) | |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1026]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[onBehalfOf])) | |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1027]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[completedBy])) | |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1028]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]), WHERE:([ee].[20110630_unpivot].[Task].[completedBy] IS NOT NULL) LOOKUP ORDERED FORWARD) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[deletedBy])) |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1029]) OPTIMIZED WITH UNORDERED PREFETCH) |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]), WHERE:([ee].[20110630_unpivot].[Task].[deletedBy] IS NOT NULL) LOOKUP ORDERED FORWARD)
Now we have 51331 persons returned in 410 ms, which required 220765 logical reads.
As we can see in the plan, the table is scanned 5 times (once for each query). This is not very efficient of course. It would be much better if each matching record in the table would be only visited once.
This is where UNPIVOT
comes into play.
As its name suggests, it does the reverse of PIVOT
, that is moves data from multiple columns to multiple columns. And this is exactly what we need in our case.
Let's try it:
SELECT SUM(CAST(personId AS BIGINT)), COUNT(*) FROM ( SELECT DISTINCT personId FROM [20110630_unpivot].Task UNPIVOT ( personId FOR personType IN (createdBy, assignedTo, onBehalfOf, completedBy, deletedBy) ) p WHERE ts &gt;= '2011-06-30' AND ts &lt; '2011-06-30 04:00:00' ) q
2573160101 | 51331 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.1802s) |
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server] SQL Server Execution Times: CPU time = 165 ms, elapsed time = 180 ms.
|--Compute Scalar(DEFINE:([Expr1013]=CASE WHEN [globalagg1016]=(0) THEN NULL ELSE [globalagg1018] END, [Expr1014]=CONVERT_IMPLICIT(int,[globalagg1020],0))) |--Stream Aggregate(DEFINE:([globalagg1016]=SUM([partialagg1015]), [globalagg1018]=SUM([partialagg1017]), [globalagg1020]=SUM([partialagg1019]))) |--Parallelism(Gather Streams) |--Compute Scalar(DEFINE:([partialagg1019]=[partialagg1015])) |--Stream Aggregate(DEFINE:([partialagg1015]=Count(*), [partialagg1017]=SUM(CONVERT(bigint,[Expr1011],0)))) |--Sort(DISTINCT ORDER BY:([Expr1011] ASC)) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011])) |--Hash Match(Partial Aggregate, HASH:([Expr1011]), RESIDUAL:([Expr1011] = [Expr1011])) |--Filter(WHERE:([Expr1011] IS NOT NULL)) |--Nested Loops(Left Outer Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy])) |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1022]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD) |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))
As we can see, this is more than twice as efficient as a UNION
and only takes 44153 page reads to complete.
How does it work internally?
We see in the plan that there is a nested loop between a result of a clustered index seek and something called Constant Scan
. The constant scan returns 5 values in each loop and those are the fields listed in the UNPIVOT
clause. It just takes each record and outputs fields found there, without rereading the record. This is actually what we wanted.
This behavior can be made more clear if we rewrite the query a little:
SELECT SUM(CAST(personId AS BIGINT)), COUNT(*) FROM ( SELECT DISTINCT personId FROM [20110630_unpivot].Task CROSS APPLY ( SELECT createdBy AS personId UNION ALL SELECT assignedTo AS personId UNION ALL SELECT onBehalfOf AS personId UNION ALL SELECT completedBy AS personId UNION ALL SELECT deletedBy AS personId ) p WHERE ts &gt;= '2011-06-30' AND ts &lt; '2011-06-30 04:00:00' AND personId IS NOT NULL ) q
2573160101 | 51331 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.1766s) |
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. [Microsoft][SQL Server Native Client 10.0][SQL Server] SQL Server Execution Times: CPU time = 149 ms, elapsed time = 176 ms.
|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [globalagg1007]=(0) THEN NULL ELSE [globalagg1009] END, [Expr1005]=CONVERT_IMPLICIT(int,[globalagg1011],0))) |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010]))) |--Parallelism(Gather Streams) |--Compute Scalar(DEFINE:([partialagg1010]=[partialagg1006])) |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=SUM(CONVERT(bigint,[Union1003],0)))) |--Sort(DISTINCT ORDER BY:([Union1003] ASC)) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Union1003])) |--Hash Match(Partial Aggregate, HASH:([Union1003]), RESIDUAL:([Union1003] = [Union1003])) |--Filter(WHERE:([Union1003] IS NOT NULL)) |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy])) |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH) | |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD) | |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD) |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))
Here, we just take each record and explode it into 5 records using CROSS APPLY
.
This yields exactly same plan, exactly same I/O and exactly same output. In fact, that's exactly what the UNPIVOT
query does.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
But surely this IS a badly designed schema. What if they decide to add more activities to the process? It will require structural alterations – indicative of poor design!?!
LA
La
17 Sep 11 at 12:49
@LA: what if they don’t? what if they looked into their model, put some thoughts into it and decided it would be the best way to store their data?
Quassnoi
18 Sep 11 at 00:59
@Quassnoi: I don’t get it. In what sense can an inflexible design be presented as a ‘better way to store data’?
So now we present them with this hypothetical ‘decision’, what are the arguments that would lead them to your conclusion?
BTW, I’m an avid reader of your blog – always clear, concise and thorough (not to mention clever) – just wish it was more frequent!
La
25 Sep 11 at 16:54
In what sense can an inflexible design be presented as a ‘better way to store data’?
It’s faster. You can retrieve all 5 values from a single record. It’s more important than hypothetical flexibility. And nothing prevents us from storing possible additional fields in an EAV table should the need arise (which is very unlikely).
BTW, I’m an avid reader of your blog – always clear, concise and thorough (not to mention clever) – just wish it was more frequent!
I too!
Quassnoi
25 Sep 11 at 17:08
… I’ve always thought PIVOT and UNPIVOT are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?
Hmm. I love pivots, and in one job used them all the time.
There have been many times, were I’ve had to use pivot or even unpivot to show or import data.
Take Hourly values for example:
Schema: {PKey, DateTime, Double}
For 24 hours of hourly data, that’s 24 rows.
For 24 hours of 5 Minute data, that’s 1440 Rows.
Now 1 user wants to now see, an avg aggerate of the 5 minute data by Hour, with the Hour as the Column Header, NOT the Row Key.
{Hour1, Hour2, Hour3}
{ 120, 200, 300}
Another user wants to see it with each hour as a Row…
{HE, Value}
{1, 120}
{2, 200}
{3, 300}
Easy to pivot.
I don’t see it pivots as bad DB design at all.
Nona
6 Oct 11 at 00:01