EXPLAIN EXTENDED

How to create fast database queries

What’s UNPIVOT good for?

with 5 comments

Answering questions asked on the site.

Karen asks:

… I've always thought PIVOT and UNPIVOT are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?

I've made an actual use for them in a project I've been working on for the last several months (which is partly why there were no updates for so long!)

Part of the project is a task management system where each task has several persons related to it. There can be the creator of the task, the person the task is assigned to, the actual author of the task (on behalf of whom the task is created), and the task can be possible completed by a person or deleted by a person. A total of 5 fields related to persons.

Now, we need to take all tasks within a certain time range and list all people involved in them.

Let's create a sample table and see how would we do that.

Table creation details

There are 5 persons fields, timestamp and stuffing. The timestamp field is indexed.

Now, let's find all people involved in the tasks between 2011-06-30 and 2011-06-30 04:00:00. To do this, we could just use 5 queries (each selecting one of the persons) and UNION them:

SELECT  SUM(CAST(id AS BIGINT)), COUNT(*)
FROM    (
SELECT  createdBy AS id
FROM    [20110630_unpivot].Task
WHERE   ts >= '2011-06-30'
AND ts < '2011-06-30 04:00:00'
UNION
SELECT  assignedTo AS id
FROM    [20110630_unpivot].Task
WHERE   ts >= '2011-06-30'
AND ts < '2011-06-30 04:00:00'
UNION
SELECT  onBehalfOf AS id
FROM    [20110630_unpivot].Task
WHERE   ts >= '2011-06-30'
AND ts < '2011-06-30 04:00:00'
UNION
SELECT  completedBy AS id
FROM    [20110630_unpivot].Task
WHERE   ts >= '2011-06-30'
AND ts < '2011-06-30 04:00:00'
AND completedBy IS NOT NULL
UNION
SELECT  deletedBy AS id
FROM    [20110630_unpivot].Task
WHERE   ts >= '2011-06-30'
AND ts < '2011-06-30 04:00:00'
AND deletedBy IS NOT NULL
) q
2573160101 51331
1 row fetched in 0.0009s (0.4251s)
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 15, logical reads 220765, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 484 ms,  elapsed time = 410 ms. 
  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [globalagg1019]=(0) THEN NULL ELSE [globalagg1021] END, [Expr1017]=CONVERT_IMPLICIT(int,[globalagg1023],0)))
       |--Stream Aggregate(DEFINE:([globalagg1019]=SUM([partialagg1018]), [globalagg1021]=SUM([partialagg1020]), [globalagg1023]=SUM([partialagg1022])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1022]=[partialagg1018]))
                      |--Stream Aggregate(DEFINE:([partialagg1018]=Count(*), [partialagg1020]=SUM(CONVERT(bigint,[Union1015],0))))
                           |--Hash Match(Aggregate, HASH:([Union1015]), RESIDUAL:([Union1015] = [Union1015]))
                                |--Concatenation
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[createdBy]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1025]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[assignedTo]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1026]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[onBehalfOf]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1027]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[completedBy]))
                                     |    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1028]) OPTIMIZED WITH UNORDERED PREFETCH)
                                     |         |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                     |         |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]),  WHERE:([ee].[20110630_unpivot].[Task].[completedBy] IS NOT NULL) LOOKUP ORDERED FORWARD)
                                     |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([ee].[20110630_unpivot].[Task].[deletedBy]))
                                          |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1029]) OPTIMIZED WITH UNORDERED PREFETCH)
                                               |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                               |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F0B3E4B07]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]),  WHERE:([ee].[20110630_unpivot].[Task].[deletedBy] IS NOT NULL) LOOKUP ORDERED FORWARD)

Now we have 51331 persons returned in 410 ms, which required 220765 logical reads.

As we can see in the plan, the table is scanned 5 times (once for each query). This is not very efficient of course. It would be much better if each matching record in the table would be only visited once.

This is where UNPIVOT comes into play.

As its name suggests, it does the reverse of PIVOT, that is moves data from multiple columns to multiple columns. And this is exactly what we need in our case.

Let's try it:

SELECT  SUM(CAST(personId AS BIGINT)), COUNT(*)
FROM    (
SELECT  DISTINCT personId
FROM    [20110630_unpivot].Task
UNPIVOT
(
personId FOR personType IN
(createdBy, assignedTo, onBehalfOf, completedBy, deletedBy)
) p
WHERE   ts &amp;gt;= '2011-06-30'
AND ts &amp;lt; '2011-06-30 04:00:00'
) q

2573160101 51331
1 row fetched in 0.0002s (0.1802s)
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 165 ms,  elapsed time = 180 ms. 
  |--Compute Scalar(DEFINE:([Expr1013]=CASE WHEN [globalagg1016]=(0) THEN NULL ELSE [globalagg1018] END, [Expr1014]=CONVERT_IMPLICIT(int,[globalagg1020],0)))
       |--Stream Aggregate(DEFINE:([globalagg1016]=SUM([partialagg1015]), [globalagg1018]=SUM([partialagg1017]), [globalagg1020]=SUM([partialagg1019])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1019]=[partialagg1015]))
                      |--Stream Aggregate(DEFINE:([partialagg1015]=Count(*), [partialagg1017]=SUM(CONVERT(bigint,[Expr1011],0))))
                           |--Sort(DISTINCT ORDER BY:([Expr1011] ASC))
                                |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
                                     |--Hash Match(Partial Aggregate, HASH:([Expr1011]), RESIDUAL:([Expr1011] = [Expr1011]))
                                          |--Filter(WHERE:([Expr1011] IS NOT NULL))
                                               |--Nested Loops(Left Outer Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy]))
                                                    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1022]) OPTIMIZED WITH UNORDERED PREFETCH)
                                                    |    |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                                    |    |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                                    |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))

As we can see, this is more than twice as efficient as a UNION and only takes 44153 page reads to complete.

How does it work internally?

We see in the plan that there is a nested loop between a result of a clustered index seek and something called Constant Scan. The constant scan returns 5 values in each loop and those are the fields listed in the UNPIVOT clause. It just takes each record and outputs fields found there, without rereading the record. This is actually what we wanted.

This behavior can be made more clear if we rewrite the query a little:

SELECT  SUM(CAST(personId AS BIGINT)), COUNT(*)
FROM    (
SELECT  DISTINCT personId
FROM    [20110630_unpivot].Task
CROSS APPLY
(
SELECT  createdBy AS personId
UNION ALL
SELECT  assignedTo AS personId
UNION ALL
SELECT  onBehalfOf AS personId
UNION ALL
SELECT  completedBy AS personId
UNION ALL
SELECT  deletedBy AS personId
) p
WHERE   ts &amp;gt;= '2011-06-30'
AND ts &amp;lt; '2011-06-30 04:00:00'
AND personId IS NOT NULL
) q
2573160101 51331
1 row fetched in 0.0002s (0.1766s)
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Task'. Scan count 3, logical reads 44153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
[Microsoft][SQL Server Native Client 10.0][SQL Server]
 SQL Server Execution Times:
   CPU time = 149 ms,  elapsed time = 176 ms. 
  |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [globalagg1007]=(0) THEN NULL ELSE [globalagg1009] END, [Expr1005]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
       |--Stream Aggregate(DEFINE:([globalagg1007]=SUM([partialagg1006]), [globalagg1009]=SUM([partialagg1008]), [globalagg1011]=SUM([partialagg1010])))
            |--Parallelism(Gather Streams)
                 |--Compute Scalar(DEFINE:([partialagg1010]=[partialagg1006]))
                      |--Stream Aggregate(DEFINE:([partialagg1006]=Count(*), [partialagg1008]=SUM(CONVERT(bigint,[Union1003],0))))
                           |--Sort(DISTINCT ORDER BY:([Union1003] ASC))
                                |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Union1003]))
                                     |--Hash Match(Partial Aggregate, HASH:([Union1003]), RESIDUAL:([Union1003] = [Union1003]))
                                          |--Filter(WHERE:([Union1003] IS NOT NULL))
                                               |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[createdBy], [ee].[20110630_unpivot].[Task].[assignedTo], [ee].[20110630_unpivot].[Task].[onBehalfOf], [ee].[20110630_unpivot].[Task].[completedBy], [ee].[20110630_unpivot].[Task].[deletedBy]))
                                                    |--Nested Loops(Inner Join, OUTER REFERENCES:([ee].[20110630_unpivot].[Task].[id], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                                                    |    |--Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[IX_Task_Ts]), SEEK:([ee].[20110630_unpivot].[Task].[ts] >= '2011-06-30 00:00:00.000' AND [ee].[20110630_unpivot].[Task].[ts] < '2011-06-30 04:00:00.000') ORDERED FORWARD)
                                                    |    |--Clustered Index Seek(OBJECT:([ee].[20110630_unpivot].[Task].[PK__Task__3213E83F14C7B541]), SEEK:([ee].[20110630_unpivot].[Task].[id]=[ee].[20110630_unpivot].[Task].[id]) LOOKUP ORDERED FORWARD)
                                                    |--Constant Scan(VALUES:(([ee].[20110630_unpivot].[Task].[createdBy]),([ee].[20110630_unpivot].[Task].[assignedTo]),([ee].[20110630_unpivot].[Task].[onBehalfOf]),([ee].[20110630_unpivot].[Task].[completedBy]),([ee].[20110630_unpivot].[Task].[deletedBy])))

Here, we just take each record and explode it into 5 records using CROSS APPLY.

This yields exactly same plan, exactly same I/O and exactly same output. In fact, that's exactly what the UNPIVOT query does.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

June 30th, 2011 at 11:00 pm

Posted in SQL Server

5 Responses to 'What’s UNPIVOT good for?'

Subscribe to comments with RSS

  1. But surely this IS a badly designed schema. What if they decide to add more activities to the process? It will require structural alterations – indicative of poor design!?!

    LA

    La

    17 Sep 11 at 12:49

  2. @LA: what if they don’t? what if they looked into their model, put some thoughts into it and decided it would be the best way to store their data?

    Quassnoi

    18 Sep 11 at 00:59

  3. @Quassnoi: I don’t get it. In what sense can an inflexible design be presented as a ‘better way to store data’?

    So now we present them with this hypothetical ‘decision’, what are the arguments that would lead them to your conclusion?

    BTW, I’m an avid reader of your blog – always clear, concise and thorough (not to mention clever) – just wish it was more frequent!

    La

    25 Sep 11 at 16:54

  4. In what sense can an inflexible design be presented as a ‘better way to store data’?

    It’s faster. You can retrieve all 5 values from a single record. It’s more important than hypothetical flexibility. And nothing prevents us from storing possible additional fields in an EAV table should the need arise (which is very unlikely).

    BTW, I’m an avid reader of your blog – always clear, concise and thorough (not to mention clever) – just wish it was more frequent!

    I too!

    Quassnoi

    25 Sep 11 at 17:08

  5. … I’ve always thought PIVOT and UNPIVOT are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?

    Hmm. I love pivots, and in one job used them all the time.

    There have been many times, were I’ve had to use pivot or even unpivot to show or import data.

    Take Hourly values for example:
    Schema: {PKey, DateTime, Double}

    For 24 hours of hourly data, that’s 24 rows.
    For 24 hours of 5 Minute data, that’s 1440 Rows.

    Now 1 user wants to now see, an avg aggerate of the 5 minute data by Hour, with the Hour as the Column Header, NOT the Row Key.
    {Hour1, Hour2, Hour3}
    { 120, 200, 300}

    Another user wants to see it with each hour as a Row…
    {HE, Value}
    {1, 120}
    {2, 200}
    {3, 300}

    Easy to pivot.

    I don’t see it pivots as bad DB design at all.

    Nona

    6 Oct 11 at 00:01

Leave a Reply