Constraints and the optimizer in SQL Server: PRIMARY KEY and UNIQUE at EXPLAIN EXTENDED

Constraints and the optimizer in SQL Server: PRIMARY KEY and UNIQUE

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Mitch asks:

I was wondering: do the constraints declared on tables in SQL Server affect the decisions made by the optimizer?

SQL Server allows to define the following constraints on the columns:

PRIMARY KEY
UNIQUE
FOREIGN KEY
CHECK
DEFAULT

Since these constraints imply some restrictions on data, it gives a theoretical possibility to use these restrictions while optimizing the queries so that some extra checks (which had already been made by the constraints) can be skipped.

SQL Server does use these possibilities and optimizes its plans taking the constraints into account.

In this article we will see how SQL Server uses PRIMARY KEY and UNIQUE to optimize the plans.

PRIMARY KEY and UNIQUE

Let's create a pair of sample tables:

Table creation details

CREATE SCHEMA [20091014_constraint]
CREATE TABLE t_primary (
        id INT NOT NULL,
        name VARCHAR(30) NOT NULL,
        CONSTRAINT PK_primary_id PRIMARY KEY (id)
)
CREATE TABLE t_foreign (
        id INT NOT NULL,
        pid INT NOT NULL,
        uval INT NOT NULL,
        dval INT NOT NULL,
        name VARCHAR(30) NOT NULL,
        CONSTRAINT PK_foreign_id PRIMARY KEY (id),
        CONSTRAINT FK_foreign_pid_primary FOREIGN KEY (pid) REFERENCES t_primary,
        CONSTRAINT UQ_foreign_uval UNIQUE (uval),
        CONSTRAINT CK_foreign_uval_dval CHECK (uval = dval)
)
GO
CREATE INDEX IX_foreign_dval ON [20091014_constraint].t_foreign (dval)
GO
BEGIN TRANSACTION
SELECT  RAND(20091014)
DECLARE @cnt INT
DECLARE @ld INT
DECLARE @r INT
SET @cnt = 1
WHILE @cnt <= 100000
BEGIN
        INSERT
        INTO    &#91;20091014_constraint&#93;.t_primary
        VALUES  (@cnt, 'Primary ' + CAST(@cnt AS VARCHAR))
        SET @ld = (@cnt - 1) * 10 + 1
        WHILE @ld <= @cnt * 10
        BEGIN
                SET @r = @ld * 100 + FLOOR(RAND() * 100)
                INSERT
                INTO    &#91;20091014_constraint&#93;.t_foreign
                VALUES  (
                        @ld,
                        @cnt,
                        @r,
                        @r,
                        'Foreign ' + CAST(@ld AS VARCHAR)
                        )
                SET @ld = @ld + 1
        END
        SET @cnt = @cnt + 1
END
COMMIT
GO
&#91;/sourcecode&#93;
</div>

<code>t_primary</code> has <strong>100,000</strong> records, <code>t_foreign</code> has <strong>1,000,000</strong> records.

There are a <code>PRIMARY KEY</code>, a <code>UNIQUE</code>, a <code>FOREIGN KEY</code> and a <code>CHECK</code> defined on <code>t_foreign</code>.

<code>PRIMARY KEY</code> and <code>UNIQUE</code> are very similar in <strong>SQL Server</strong>. There are only two major differences:

<ul>
	<li>There can be only one <code>PRIMARY KEY</code> in a table but multiple <code>UNIQUE</code> constraints are possible</li>
	<li>A <code>PRIMARY KEY</code> cannot accept <code>NULL</code> values while a <code>UNIQUE</code> column can. Unlike other systems, <code>NULL</code> count as distinct values, there can be only one <code>NULL</code> in a unique column.</li>
</ul>

<strong>SQL Server</strong>'s optimizer can use these constraints to build more optimal plans. Since the values of the constrained fields are guaranteed to be <code>UNQIUE</code>, there are at least two optimizations that <strong>SQL Server</strong> does:

<ul>
	<li>If there is a <code>UNIQUE</code> column in the <code>SELECT</code> clause, <code>DISTINCT</code> is ignored</li>
	<li>If there is a sort on a <code>UNIQUE</code> column, all subsequent sort expressions are ignored</li>
</ul>

<h4>DISTINCT</h4>

Let's run two identical queries on <code>uval</code> and <code>dval</code> in <code>t_foreign</code>:


SELECT  SUM(LEN(name)) AS usum
FROM    (
        SELECT  DISTINCT uval, name
        FROM    [20091014_constraint].t_foreign
        ) q

View query details

usum
13888896
1 row fetched in 0.0002s (0.3641s)

Table 't_foreign'. Scan count 3, logical reads 7744, physical reads 0, read-ahead reads 409, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 407 ms,  elapsed time = 359 ms.

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [globalagg1006]=(0) THEN NULL ELSE [globalagg1008] END))
       |--Stream Aggregate(DEFINE:([globalagg1006]=SUM([partialagg1005]), [globalagg1008]=SUM([partialagg1007])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1005]=COUNT_BIG([Expr1004]), [partialagg1007]=SUM([Expr1004])))
                      |--Compute Scalar(DEFINE:([Expr1004]=len([test].[20091014_constraint].[t_foreign].[name])))
                           |--Clustered Index Scan(OBJECT:([test].[20091014_constraint].[t_foreign].[PK_foreign_id]))

This query uses DISTINCT on a column marked as UNIQUE. Since the DISTINCT is redundant here, it is ignored and the query uses a single Stream Aggregate to calculate the resulting aggregate function.

This query completes in 0.35 seconds.

Now, let's run the same query on a non-unique field:

SELECT  SUM(LEN(name)) AS dsum
FROM    (
        SELECT  DISTINCT dval, name
        FROM    [20091014_constraint].t_foreign
        ) q

View query details

dsum
13888896
1 row fetched in 0.0002s (4.5311s)

Table 't_foreign'. Scan count 3, logical reads 7744, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 3563 ms,  elapsed time = 4516 ms.

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [globalagg1006]=(0) THEN NULL ELSE [globalagg1008] END))
       |--Stream Aggregate(DEFINE:([globalagg1006]=SUM([partialagg1005]), [globalagg1008]=SUM([partialagg1007])))
            |--Parallelism(Gather Streams)
                 |--Stream Aggregate(DEFINE:([partialagg1005]=COUNT_BIG([Expr1004]), [partialagg1007]=SUM([Expr1004])))
                      |--Compute Scalar(DEFINE:([Expr1004]=len([test].[20091014_constraint].[t_foreign].[name])))
                           |--Hash Match(Aggregate, HASH:([test].[20091014_constraint].[t_foreign].[dval], [test].[20091014_constraint].[t_foreign].[name]), RESIDUAL:([test].[20091014_constraint].[t_foreign].[dval] = [test].[20091014_constraint].[t_foreign].[dval] AND [test].[20091014_constraint].[t_foreign].[name] = [test].[20091014_constraint].[t_foreign].[name]))
                                |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([test].[20091014_constraint].[t_foreign].[dval], [test].[20091014_constraint].[t_foreign].[name]))
                                     |--Clustered Index Scan(OBJECT:([test].[20091014_constraint].[t_foreign].[PK_foreign_id]))

Now, there is no more UNIQUE constraint on a column and the engine has to do the DISTINCT. To do this it uses an additional Hash Match which of course doesn't add to performance.

The query now runs for 4.5 seconds, or 10 times as slow.

ORDER BY

Again, two identical queries. The first one sorts on uval, dval, the second one sorts on dval, uval.

SELECT  TOP 10
        *
FROM    [20091014_constraint].t_foreign
ORDER BY
        uval, dval

View query details

id	pid	uval	dval	name
1	1	134	134	Foreign 1
2	1	221	221	Foreign 2
3	1	345	345	Foreign 3
4	1	449	449	Foreign 4
5	1	505	505	Foreign 5
6	1	677	677	Foreign 6
7	1	727	727	Foreign 7
8	1	857	857	Foreign 8
9	1	962	962	Foreign 9
10	1	1073	1073	Foreign 10
10 rows fetched in 0.0006s (0.0007s)

Table 't_foreign'. Scan count 1, logical reads 38, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms.

  |--Top(TOP EXPRESSION:((10)))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[20091014_constraint].[t_foreign].[id], [Expr1003]) WITH ORDERED PREFETCH)
            |--Index Scan(OBJECT:([test].[20091014_constraint].[t_foreign].[UQ_foreign_uval]), ORDERED FORWARD)
            |--Clustered Index Seek(OBJECT:([test].[20091014_constraint].[t_foreign].[PK_foreign_id]), SEEK:([test].[20091014_constraint].[t_foreign].[id]=[test].[20091014_constraint].[t_foreign].[id]) LOOKUP ORDERED FORWARD)

Since uval is unique and declared as such, ordering on dval is useless and can be reduced to a mere ORDER BY uval. This is servable by the index on uval (which is implicitly created for any UNIQUE constraint), and yields the execution plan with a Nested Loops to join the index and the table and a Top to return just first 10 values.

This query is within 1 ms, that is instant.

Now, the second query:

SELECT  TOP 10
        *
FROM    [20091014_constraint].t_foreign
ORDER BY
        dval, uval

View query details

id	pid	uval	dval	name
1	1	134	134	Foreign 1
2	1	221	221	Foreign 2
3	1	345	345	Foreign 3
4	1	449	449	Foreign 4
5	1	505	505	Foreign 5
6	1	677	677	Foreign 6
7	1	727	727	Foreign 7
8	1	857	857	Foreign 8
9	1	962	962	Foreign 9
10	1	1073	1073	Foreign 10
10 rows fetched in 0.0006s (0.7031s)

Table 't_foreign'. Scan count 3, logical reads 7744, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 749 ms,  elapsed time = 696 ms.

  |--Top(TOP EXPRESSION:((10)))
       |--Parallelism(Gather Streams, ORDER BY:([test].[20091014_constraint].[t_foreign].[dval] ASC, [test].[20091014_constraint].[t_foreign].[uval] ASC))
            |--Sort(TOP 10, ORDER BY:([test].[20091014_constraint].[t_foreign].[dval] ASC, [test].[20091014_constraint].[t_foreign].[uval] ASC))
                 |--Clustered Index Scan(OBJECT:([test].[20091014_constraint].[t_foreign].[PK_foreign_id]))

Though an index exists on dval, it does not contain enough information to do all the sorting. SQL Server would need to join this index back to the table, fetch the correponsing value of uval and sort on both fields.

This is quite an expensive operation, and SQL Server decided to ignore it. Instead, a full scan on the table itself is performed (to fetch the values of both uval and dval in one pass).

This query works for 0.7 seconds.

Summary

UNIQUE constraints help the optimizer to do the following things:

Ignore DISTINCT on a SELECT clause containing a UNIQUE column
Ignore the values after the UNIQUE column is ORDER BY

This makes the corresponding queries more efficient.

To be continued.

Written by Quassnoi

October 14th, 2009 at 11:00 pm

Posted in SQL Server

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED