Efficient EXISTS
Comments enabled. I *really* need your comment
In my previous article:
I compared efficiency of different methods to check for existence of a value in a subquery resultset.
These methods included IN
and EXISTS
predicates.
It was shown that in SQL Server, a poorly written EXISTS
can be very slow, while an IN
predicate is almost always efficiently optimized using one of the numerous SEMI JOIN
methods.
However, a properly written EXISTS
predicate can outperform IN
.
Let's create a sample table and see how:
CREATE SCHEMA [20090617_exists] CREATE TABLE t_outer ( id INT NOT NULL PRIMARY KEY, val1 INT NOT NULL, val2 INT NOT NULL ) CREATE TABLE t_inner ( id INT NOT NULL PRIMARY KEY, val1 INT NOT NULL, val2 INT NOT NULL ) GO CREATE INDEX ix_outer_val1_val2 ON [20090617_exists].t_outer (val1, val2) CREATE INDEX ix_inner_val1_val2 ON [20090617_exists].t_inner (val1, val2) BEGIN TRANSACTION SELECT RAND(20090617) DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 100000 BEGIN INSERT INTO [20090617_exists].t_inner VALUES (@cnt, RAND() * 50000, RAND() * 50000) INSERT INTO [20090617_exists].t_outer VALUES (@cnt, RAND() * 50000, RAND() * 50000) SET @cnt = @cnt + 1 END COMMIT GO
There are two tables: t_outer
and t_inner
.
val1
and val2
are filled with random values, and there is a composite index on (val1, val2)
in both tables.
Now, let's try to find all rows in t_outer
for which there are rows in t_inner
with both val1
and val2
matching.
Unlike MySQL, SQL Server does not support multiple columns in IN
predicate subquery.
However, we can emulate it: we can cast both values into VARCHAR
, concatenate them using a separator, and do the same in the subquery.
We thus can make a single column out of two columns, and it can be used in an IN
subquery:
SELECT * FROM [20090617_exists].t_outer WHERE CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR) IN ( SELECT CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR) FROM [20090617_exists].t_inner )
id | val1 | val2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
85652 | 12183 | 34543 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
75842 | 45326 | 6593 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20930 | 45658 | 21280 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0003s (1.8436s) |
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_inner'. Scan count 1, logical reads 262, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1140 ms, elapsed time = 1829 ms.
SELECT Hash Match (Right Semi Join) Hash Match (Aggregate) Compute Scalar Index Scan (t_inner) Compute Scalar Clustered Index Scan (t_outer)
SQL Server optimizer decided to use Hash Match (Right Semi Join)
in this case.
It means that the hash table is built on the subquery table (t_inner
) which is cleared of the duplicates before that using Hash Match (Aggregate)
, and then every row from t_outer
is probed against the hash table.
This requires building two quite large hash tables, and converting from INT
to VARCHAR
takes some time, that's why this query is not very efficient: it runs for almost 2 seconds.
In MySQL, the query above could be rewritten as this:
SELECT * FROM t_outer WHERE (val1, val2) IN ( SELECT val1, val2 FROM t_inner )
MySQL's optimizer tends to treat all IN
predicates as special cases of an EXISTS
predicate.
That's why in MySQL the query above would be internally rewritten as the following query:
SELECT * FROM t_outer o WHERE EXISTS ( SELECT 1 FROM t_inner i WHERE i.val1 = o.val1 AND i.val2 = o.val2 )
Though the former query does not compile in SQL Server, the latter is perfectly legal.
Let's see how it works:
SELECT * FROM [20090617_exists].t_outer i WHERE EXISTS ( SELECT 1 FROM [20090617_exists].t_inner o WHERE o.val1 = i.val1 AND o.val2 = i.val2 )
id | val1 | val2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
85652 | 12183 | 34543 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
75842 | 45326 | 6593 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20930 | 45658 | 21280 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0003s (0.0978s) |
Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_inner'. Scan count 1, logical reads 264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 94 ms, elapsed time = 98 ms.
SELECT Merge (Inner Join) Stream Aggregate Index Scan (t_inner) Index Scan (t_outer)
This query uses a Merge Join
: the same method that was used in the previous article (IN vs. JOIN vs. EXISTS) for an IN
predicate on an indexed column.
This is much more efficient: it takes less than 100 ms to complete.
Despite SQL Server's inability to use a muptiple-column subquery in an IN
predicate, the engine sees that this EXISTS
predicate would be a synonym for such a query, and uses exactly same plan it would have used for a single-column IN
predicate.
Summary:
EXISTS
predicate is much more flexible than IN
predicate, especially in SQL Server where the latter one has very limited support.
While a complex, non-sargable and non-hashable EXISTS
predicate can perform very poorly and a plan can be built for it than would reevaluate the subquery on each loop, an EXISTS
predicate can also be very efficient.
For instance, EXISTS
equivalent of a multiple column IN
predicate:
(val1, val2) IN (SELECT val1, val1 FROM t_inner)
is recognized as such by SQL Server's optimizer and an efficient Merge Join
method is proposed for this, despite the fact that SQL Server does not support multiple-column IN
predicates directly.