EXPLAIN EXTENDED

How to create fast database queries

Efficient EXISTS

Comments enabled. I *really* need your comment

In my previous article:

I compared efficiency of different methods to check for existence of a value in a subquery resultset.

These methods included IN and EXISTS predicates.

It was shown that in SQL Server, a poorly written EXISTS can be very slow, while an IN predicate is almost always efficiently optimized using one of the numerous SEMI JOIN methods.

However, a properly written EXISTS predicate can outperform IN.

Let's create a sample table and see how:

CREATE SCHEMA [20090617_exists]
CREATE TABLE t_outer (
id INT NOT NULL PRIMARY KEY,
val1 INT NOT NULL,
val2 INT NOT NULL
)
CREATE TABLE t_inner (
id INT NOT NULL PRIMARY KEY,
val1 INT NOT NULL,
val2 INT NOT NULL
)
GO
CREATE INDEX ix_outer_val1_val2 ON [20090617_exists].t_outer (val1, val2)
CREATE INDEX ix_inner_val1_val2 ON [20090617_exists].t_inner (val1, val2)
BEGIN TRANSACTION
SELECT  RAND(20090617)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 100000
BEGIN
        INSERT
        INTO    [20090617_exists].t_inner
        VALUES  (@cnt, RAND() * 50000, RAND() * 50000)
        INSERT
        INTO    [20090617_exists].t_outer
        VALUES  (@cnt, RAND() * 50000, RAND() * 50000)
        SET @cnt = @cnt + 1
END

COMMIT
GO

There are two tables: t_outer and t_inner.

val1 and val2 are filled with random values, and there is a composite index on (val1, val2) in both tables.

Now, let's try to find all rows in t_outer for which there are rows in t_inner with both val1 and val2 matching.

Unlike MySQL, SQL Server does not support multiple columns in IN predicate subquery.

However, we can emulate it: we can cast both values into VARCHAR, concatenate them using a separator, and do the same in the subquery.

We thus can make a single column out of two columns, and it can be used in an IN subquery:

SELECT  *
FROM    [20090617_exists].t_outer
WHERE   CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR) IN
(
SELECT  CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR)
FROM    [20090617_exists].t_inner
)
id val1 val2
85652 12183 34543
75842 45326 6593
20930 45658 21280
3 rows fetched in 0.0003s (1.8436s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_inner'. Scan count 1, logical reads 262, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1140 ms,  elapsed time = 1829 ms. 
SELECT
  Hash Match (Right Semi Join)
    Hash Match (Aggregate)
      Compute Scalar
        Index Scan (t_inner)
    Compute Scalar
      Clustered Index Scan (t_outer)

SQL Server optimizer decided to use Hash Match (Right Semi Join) in this case.

It means that the hash table is built on the subquery table (t_inner) which is cleared of the duplicates before that using Hash Match (Aggregate), and then every row from t_outer is probed against the hash table.

This requires building two quite large hash tables, and converting from INT to VARCHAR takes some time, that's why this query is not very efficient: it runs for almost 2 seconds.

In MySQL, the query above could be rewritten as this:

SELECT  *
FROM    t_outer
WHERE   (val1, val2) IN
(
SELECT  val1, val2
FROM    t_inner
)

MySQL's optimizer tends to treat all IN predicates as special cases of an EXISTS predicate.

That's why in MySQL the query above would be internally rewritten as the following query:

SELECT  *
FROM    t_outer o
WHERE   EXISTS
(
SELECT  1
FROM    t_inner i
WHERE i.val1 = o.val1
AND i.val2 = o.val2
)

Though the former query does not compile in SQL Server, the latter is perfectly legal.

Let's see how it works:

SELECT  *
FROM    [20090617_exists].t_outer i
WHERE   EXISTS
(
SELECT  1
FROM    [20090617_exists].t_inner o
WHERE   o.val1 = i.val1
AND o.val2 = i.val2
)

id val1 val2
85652 12183 34543
75842 45326 6593
20930 45658 21280
3 rows fetched in 0.0003s (0.0978s)
Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_inner'. Scan count 1, logical reads 264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 98 ms. 
SELECT
  Merge (Inner Join)
    Stream Aggregate
      Index Scan (t_inner)
    Index Scan (t_outer)

This query uses a Merge Join: the same method that was used in the previous article (IN vs. JOIN vs. EXISTS) for an IN predicate on an indexed column.

This is much more efficient: it takes less than 100 ms to complete.

Despite SQL Server's inability to use a muptiple-column subquery in an IN predicate, the engine sees that this EXISTS predicate would be a synonym for such a query, and uses exactly same plan it would have used for a single-column IN predicate.

Summary:

EXISTS predicate is much more flexible than IN predicate, especially in SQL Server where the latter one has very limited support.

While a complex, non-sargable and non-hashable EXISTS predicate can perform very poorly and a plan can be built for it than would reevaluate the subquery on each loop, an EXISTS predicate can also be very efficient.

For instance, EXISTS equivalent of a multiple column IN predicate:

(val1, val2) IN (SELECT val1, val1 FROM t_inner)

is recognized as such by SQL Server's optimizer and an efficient Merge Join method is proposed for this, despite the fact that SQL Server does not support multiple-column IN predicates directly.

Written by Quassnoi

June 17th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply