EXPLAIN EXTENDED

How to create fast database queries

Included columns

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Peter asks:

I've been looking around for explanation of how INCLUDED columns in SQL Server indexes work and found myself totally confused.

Could you please explain what they are for?

Thanks!

In a couple of words, an INCLUDED column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.

This can save some DML overhead on columns which are filtered on non-ranged conditions.

If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.

This can help to get rid of expensive Key Lookups or Clustered Index Scans.

In one of applications I developed, I had a large table that kept the restaurant orders.

The table was very large both in row count and the row size: a little less than a hundred columns.

Among other fields there was a DATETIME column (the order date) and an INT column (the client id)

One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.

For the sake of brevity, I'll replace all these numerous columns with a single NVARCHAR(4000) column.

This will make the table definition more brief and comprehensible, leaving the table rows as large.

Here's what the table looked like:

SET NOCOUNT ON
GO
DROP TABLE [20090626_include].t_order
DROP SCHEMA [20090626_include]
GO
CREATE SCHEMA [20090626_include]
CREATE TABLE t_order (
id INT NOT NULL PRIMARY KEY,
date DATETIME NOT NULL,
stuffing NVARCHAR(4000) NOT NULL,
uclient INT NOT NULL,
iclient INT NOT NULL,
price FLOAT NOT NULL
)
GO
CREATE INDEX ix_order_date__iclient ON [20090626_include].t_order(date) INCLUDE (iclient)
BEGIN TRANSACTION
SELECT RAND(20090626)
DECLARE @cnt INT
DECLARE @client INT
SET @cnt = 1
WHILE @cnt <= 100000 BEGIN SET @client = FLOOR(RAND() * 100) + 1 INSERT INTO [20090626_include].t_order VALUES ( @cnt, DATEADD(minute, -@cnt, CAST('2009-26-06' AS DATETIME)), REPLICATE('A', 4000), @client, @client, FLOOR(RAND() * 1000) + 250 ) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] As you can see, the table t_order has an index on date with included column iclient.

The column uclient is not include into the index.

Values in iclient and uclient are always the same.

Now, the query that made the report looked something like this:

SELECT AVG(price)
FROM [20090626_include].t_order
WHERE date >= '2009 Jun 01'
AND date < '2009 Jul 01' AND uclient IN (10, 20, 30) [/sourcecode]

756.82067703568157
1 row fetched in 0.0002s (23.3746s)
Table 't_order'. Scan count 3, logical reads 113292, physical reads 177, read-ahead reads 98054, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 827 ms,  elapsed time = 23378 ms. 
SELECT
  Compute Scalar
    Stream Aggregate
      Parallelism (Gather Streams)
        Stream Aggregate
          Clustered Index Scan ([20090626_include].[t_order])

This runs for more than 23 seconds, which is very long.

We see that there is a Clustered Index Scan in the execution plan.

Since the table rows are very large, this is a very expensive operation.

However, in current design this is the least of evils: the condition on the date is not very selective, and the engine needs to look into the table rows to find out the value of uclient anyway.

But if we replace uclient with iclient in the query above, we will get the following results:

SELECT AVG(price)
FROM [20090626_include].t_order
WHERE date >= '2009 Jun 01'
AND date < '2009 Jul 01' AND iclient IN (10, 20, 30) [/sourcecode]

756.82067703568157
1 row fetched in 0.0002s (0.0119s)
Table 't_order'. Scan count 1, logical reads 3506, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 12 ms. 

This yields the same result and runs for only 12 ms: quite a difference.

Why so?

Here's the execution plan for the query:

SELECT
  Compute Scalar
    Stream Aggregate
      Nested Loops (Inner Join)
        Index Seek ([20090626_include].[t_order].[t_order_date__iclient])
        Key Lookup ([20090626_include].[t_order])

The index now is used for filtering the values of iclient.

This is done not with ranging, but with mere filtering, just like in Clustered Index Scan in the previous query.

But since the index leaves are so much less in size than the table rows, the filtering can be performed in no time with just a few page logical reads from the memory cache.

Of course we could achieve the same results with a plain composite index on (date, iclient).

But this would imply overhead of addional sorting on iclient, and since we don't range the clients anyway, the INCLUDE clause is just what we need for this query.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

June 26th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply