Efficient EXISTS
Comments enabled. I *really* need your comment
In my previous article:
I compared efficiency of different methods to check for existence of a value in a subquery resultset.
These methods included IN and EXISTS predicates.
It was shown that in SQL Server, a poorly written EXISTS can be very slow, while an IN predicate is almost always efficiently optimized using one of the numerous SEMI JOIN methods.
However, a properly written EXISTS predicate can outperform IN.
Let's create a sample table and see how:
CREATE SCHEMA [20090617_exists]
CREATE TABLE t_outer (
id INT NOT NULL PRIMARY KEY,
val1 INT NOT NULL,
val2 INT NOT NULL
)
CREATE TABLE t_inner (
id INT NOT NULL PRIMARY KEY,
val1 INT NOT NULL,
val2 INT NOT NULL
)
GO
CREATE INDEX ix_outer_val1_val2 ON [20090617_exists].t_outer (val1, val2)
CREATE INDEX ix_inner_val1_val2 ON [20090617_exists].t_inner (val1, val2)
BEGIN TRANSACTION
SELECT RAND(20090617)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 100000
BEGIN
INSERT
INTO [20090617_exists].t_inner
VALUES (@cnt, RAND() * 50000, RAND() * 50000)
INSERT
INTO [20090617_exists].t_outer
VALUES (@cnt, RAND() * 50000, RAND() * 50000)
SET @cnt = @cnt + 1
END
COMMIT
GO
There are two tables: t_outer and t_inner.
val1 and val2 are filled with random values, and there is a composite index on (val1, val2) in both tables.
Now, let's try to find all rows in t_outer for which there are rows in t_inner with both val1 and val2 matching.
Unlike MySQL, SQL Server does not support multiple columns in IN predicate subquery.
However, we can emulate it: we can cast both values into VARCHAR, concatenate them using a separator, and do the same in the subquery.
We thus can make a single column out of two columns, and it can be used in an IN subquery:
SELECT * FROM [20090617_exists].t_outer WHERE CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR) IN ( SELECT CAST(val1 AS VARCHAR) + '|' + CAST(val2 AS VARCHAR) FROM [20090617_exists].t_inner )
| id | val1 | val2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 85652 | 12183 | 34543 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 75842 | 45326 | 6593 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20930 | 45658 | 21280 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 rows fetched in 0.0003s (1.8436s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_inner'. Scan count 1, logical reads 262, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1140 ms, elapsed time = 1829 ms.
SELECT
Hash Match (Right Semi Join)
Hash Match (Aggregate)
Compute Scalar
Index Scan (t_inner)
Compute Scalar
Clustered Index Scan (t_outer)
SQL Server optimizer decided to use Hash Match (Right Semi Join) in this case.
It means that the hash table is built on the subquery table (t_inner) which is cleared of the duplicates before that using Hash Match (Aggregate), and then every row from t_outer is probed against the hash table.
This requires building two quite large hash tables, and converting from INT to VARCHAR takes some time, that's why this query is not very efficient: it runs for almost 2 seconds.
In MySQL, the query above could be rewritten as this:
SELECT * FROM t_outer WHERE (val1, val2) IN ( SELECT val1, val2 FROM t_inner )
MySQL's optimizer tends to treat all IN predicates as special cases of an EXISTS predicate.
That's why in MySQL the query above would be internally rewritten as the following query:
SELECT * FROM t_outer o WHERE EXISTS ( SELECT 1 FROM t_inner i WHERE i.val1 = o.val1 AND i.val2 = o.val2 )
Though the former query does not compile in SQL Server, the latter is perfectly legal.
Let's see how it works:
SELECT * FROM [20090617_exists].t_outer i WHERE EXISTS ( SELECT 1 FROM [20090617_exists].t_inner o WHERE o.val1 = i.val1 AND o.val2 = i.val2 )
| id | val1 | val2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 85652 | 12183 | 34543 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 75842 | 45326 | 6593 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20930 | 45658 | 21280 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 rows fetched in 0.0003s (0.0978s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 't_outer'. Scan count 1, logical reads 259, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_inner'. Scan count 1, logical reads 264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 94 ms, elapsed time = 98 ms.
SELECT
Merge (Inner Join)
Stream Aggregate
Index Scan (t_inner)
Index Scan (t_outer)
This query uses a Merge Join: the same method that was used in the previous article (IN vs. JOIN vs. EXISTS) for an IN predicate on an indexed column.
This is much more efficient: it takes less than 100 ms to complete.
Despite SQL Server's inability to use a muptiple-column subquery in an IN predicate, the engine sees that this EXISTS predicate would be a synonym for such a query, and uses exactly same plan it would have used for a single-column IN predicate.
Summary:
EXISTS predicate is much more flexible than IN predicate, especially in SQL Server where the latter one has very limited support.
While a complex, non-sargable and non-hashable EXISTS predicate can perform very poorly and a plan can be built for it than would reevaluate the subquery on each loop, an EXISTS predicate can also be very efficient.
For instance, EXISTS equivalent of a multiple column IN predicate:
(val1, val2) IN (SELECT val1, val1 FROM t_inner)
is recognized as such by SQL Server's optimizer and an efficient Merge Join method is proposed for this, despite the fact that SQL Server does not support multiple-column IN predicates directly.
Subscribe in a reader