Included columns
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Peter asks:
I've been looking around for explanation of how
INCLUDED
columns in SQL Server indexes work and found myself totally confused.Could you please explain what they are for?
Thanks!
In a couple of words, an INCLUDED
column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.
This can save some DML
overhead on columns which are filtered on non-ranged conditions.
If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.
This can help to get rid of expensive Key Lookups
or Clustered Index Scans
.
In one of applications I developed, I had a large table that kept the restaurant orders.
The table was very large both in row count and the row size: a little less than a hundred columns.
Among other fields there was a DATETIME
column (the order date) and an INT
column (the client id
)
One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.
For the sake of brevity, I'll replace all these numerous columns with a single NVARCHAR(4000)
column.
This will make the table definition more brief and comprehensible, leaving the table rows as large.
Here's what the table looked like:
SET NOCOUNT ON
GO
DROP TABLE [20090626_include].t_order
DROP SCHEMA [20090626_include]
GO
CREATE SCHEMA [20090626_include]
CREATE TABLE t_order (
id INT NOT NULL PRIMARY KEY,
date DATETIME NOT NULL,
stuffing NVARCHAR(4000) NOT NULL,
uclient INT NOT NULL,
iclient INT NOT NULL,
price FLOAT NOT NULL
)
GO
CREATE INDEX ix_order_date__iclient ON [20090626_include].t_order(date) INCLUDE (iclient)
BEGIN TRANSACTION
SELECT RAND(20090626)
DECLARE @cnt INT
DECLARE @client INT
SET @cnt = 1
WHILE @cnt <= 100000
BEGIN
SET @client = FLOOR(RAND() * 100) + 1
INSERT
INTO [20090626_include].t_order
VALUES (
@cnt,
DATEADD(minute, -@cnt, CAST('2009-26-06' AS DATETIME)),
REPLICATE('A', 4000),
@client,
@client,
FLOOR(RAND() * 1000) + 250
)
SET @cnt = @cnt + 1
END
COMMIT
GO
[/sourcecode]
As you can see, the table t_order
has an index on date
with included column iclient
.
The column uclient
is not include into the index.
Values in iclient
and uclient
are always the same.
Now, the query that made the report looked something like this:
SELECT AVG(price)
FROM [20090626_include].t_order
WHERE date >= '2009 Jun 01'
AND date < '2009 Jul 01'
AND uclient IN (10, 20, 30)
[/sourcecode]
756.82067703568157 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (23.3746s) |
Table 't_order'. Scan count 3, logical reads 113292, physical reads 177, read-ahead reads 98054, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 827 ms, elapsed time = 23378 ms.
SELECT Compute Scalar Stream Aggregate Parallelism (Gather Streams) Stream Aggregate Clustered Index Scan ([20090626_include].[t_order])
This runs for more than 23 seconds, which is very long.
We see that there is a Clustered Index Scan
in the execution plan.
Since the table rows are very large, this is a very expensive operation.
However, in current design this is the least of evils: the condition on the date
is not very selective, and the engine needs to look into the table rows to find out the value of uclient
anyway.
But if we replace uclient
with iclient
in the query above, we will get the following results:
SELECT AVG(price)
FROM [20090626_include].t_order
WHERE date >= '2009 Jun 01'
AND date < '2009 Jul 01'
AND iclient IN (10, 20, 30)
[/sourcecode]
756.82067703568157 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.0119s) |
Table 't_order'. Scan count 1, logical reads 3506, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 16 ms, elapsed time = 12 ms.
This yields the same result and runs for only 12 ms: quite a difference.
Why so?
Here's the execution plan for the query:
SELECT Compute Scalar Stream Aggregate Nested Loops (Inner Join) Index Seek ([20090626_include].[t_order].[t_order_date__iclient]) Key Lookup ([20090626_include].[t_order])
The index now is used for filtering the values of iclient
.
This is done not with ranging, but with mere filtering, just like in Clustered Index Scan
in the previous query.
But since the index leaves are so much less in size than the table rows, the filtering can be performed in no time with just a few page logical reads from the memory cache.
Of course we could achieve the same results with a plain composite index on (date, iclient)
.
But this would imply overhead of addional sorting on iclient
, and since we don't range the clients anyway, the INCLUDE
clause is just what we need for this query.
Hope that helps.
I'm always glad to answer the questions regarding database queries.