EXPLAIN EXTENDED

How to create fast database queries

IN with a comma separated list: SQL Server

with 3 comments

From Stack Overflow:

How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?

Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.

One should pass lists in a table variable rather than in a set of comma separated values.

However, there are legacy applications, ORM's and these kinds of stuff that just leave you no choice.

MySQL provides a nice function find_in_set designed exactly for this task.

SQL Server does not directly support an IN predicate for CSV, neither it provides find_in_set. It needs a rowset for IN predicate to work.

To expand a comma separated list into a rowset, we will use a recursive CTE.

In the CTE, we should search for the position of every next comma and return it.

Then we can just take the substring between two commas and cast it into an INT:

DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15, 296, 5358';
WITH    cd AS
        (
        SELECT  1 AS first, CHARINDEX(',', @lid, 1) AS next
        UNION ALL
        SELECT  next + 1, CHARINDEX(',', @lid, next + 1)
        FROM    cd
        WHERE   next > 0
        ),
        lid AS
        (
        SELECT  CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
        FROM    cd
        )
SELECT  *
FROM    lid
id
3
14
15
296
5358

Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.

Let's create a sample table and build a query to do this:

CREATE SCHEMA [20090622_csv]
CREATE TABLE t_data (
        id INT NOT NULL,
        val INT NOT NULL
)
GO
CREATE INDEX ix_data_val ON [20090622_csv].t_data(val)
BEGIN TRANSACTION
SELECT  RAND(20090622)
DECLARE @cnt INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        INSERT
        INTO    &#91;20090622_csv&#93;.t_data
        VALUES  (@cnt, FLOOR(RAND() * 200000) + 1)
        SET @cnt = @cnt + 1
END
COMMIT
GO
&#91;/sourcecode&#93;

This table has <strong>1,000,000</strong> rows with column <code>val</code> randomly filled with values from <strong>1</strong> to <strong>200,000</strong>.

How do we search it for all values from a comma separated list?

We could use the naïve approach, using an <code>OR</code> to combine these two conditions:


DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15926';
WITH    cd AS
        (
        SELECT  1 AS first, CHARINDEX(',', @lid, 1) AS next
        UNION ALL
        SELECT  next + 1, CHARINDEX(',', @lid, next + 1)
        FROM    cd
        WHERE   next > 0
        ),
        lid AS
        (
        SELECT  CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
        FROM    cd
        )
SELECT  *
FROM    [20090622_csv].t_data
WHERE   val IN
        (
        SELECT  id
        FROM    lid
        )
        OR @lid = '0'

id val
418865 14
480400 3
484262 3
861197 15926
246345 3
246484 14
466169 15926
7 rows fetched in 0.0003s (2.7968s)
Table 'Worktable'. Scan count 6, logical reads 2000052, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_data'. Scan count 3, logical reads 4017, physical reads 0, read-ahead reads 676, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 5422 ms,  elapsed time = 2794 ms. 

This, however, is vastly inefficient.

For this query, SQL Server builds a plan which scans all table rows, filtering them for this constant value.

The filter, of course, will always return either all rows or no rows at all, and one can tell which on with a single look on the value of @lid, but SQL Server still checks all rows.

To make this query more efficient, we should use an approach similar to one I've used in one of my earlier articles:

We just need to create two mutually exclusive queries, each using the filters combined with an AND operator.

Since SQL Server is much smarter at optimizing AND's, this should be more efficient.

Here's the query:

DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15926';
WITH    cd AS
        (
        SELECT  1 AS first, CHARINDEX(',', @lid, 1) AS next
        UNION ALL
        SELECT  next + 1, CHARINDEX(',', @lid, next + 1)
        FROM    cd
        WHERE   next > 0
        ),
        lid AS
        (
        SELECT  CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
        FROM    cd
        )
SELECT  d.*
FROM    (
        SELECT  DISTINCT id
        FROM    lid
        ) l
JOIN    [20090622_csv].t_data d
ON      d.val = l.id
        AND @lid <> '0'
UNION ALL
SELECT  *
FROM    [20090622_csv].t_data
WHERE   @lid = '0'

id val
246345 3
480400 3
484262 3
246484 14
418865 14
466169 15926
861197 15926
7 rows fetched in 0.0003s (0.0012s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_data'. Scan count 3, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 2, logical reads 20, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 

This is just instant.

Note that I used a JOIN instead of an IN predicate for this query.

Though SQL Server usually optimizes an IN predicate quite well, in this case the subquery would contain a result of a CTE for which SQL Server just cannot calculate statistics.

It would then revert to a method of choice for joining two sorted rowsets of unknown size: Nested Loops (Left Semi Join) with t_data leading.

This method is quite efficient, but it still requires scanning the whole t_data which is quite large.

Method of choice for a JOIN, however, is Nested Loops (Inner Join) with the subquery leading is this case.

Since the subquery returns but 4 values, they can be looked up instantly using the index on t_data.val, and the query itself is very fast.

Written by Quassnoi

June 23rd, 2009 at 11:00 pm

Posted in SQL Server

3 Responses to 'IN with a comma separated list: SQL Server'

Subscribe to comments with RSS

  1. […] IN with a comma separated list: SQL Server […]

  2. CREATE FUNCTION Find_In_Set (@Search_For varChar(8000), @Search_This VarChar(8000))
    RETURNS SmallInt AS
    BEGIN
    DECLARE
    @Results VarChar(8000),
    @Evaluate VarChar(8000),
    @Pointer_Original SmallInt,
    @Pointer_Working SmallInt,
    @Original_Length SmallInt,
    @Comma_Counter SmallInt,
    @Current_Character Char(1),
    @Returned SmallInt
    SET @Pointer_Original = 0
    SET @Pointer_Working = 0
    SET @Comma_Counter = 1
    SET @Returned = 0
    SET @Original_Length = Len(@Search_This)
    SET @Search_This = RTrim(LTrim(@Search_This))
    SET @Results = @Search_This
    WHILE @Pointer_Original < @Original_Length
    BEGIN
    SET @Pointer_Original = @Pointer_Original + 1
    SET @Pointer_Working = @Pointer_Working + 1
    SET @Current_Character = SubString(@Results,@Pointer_Working,1)
    IF @Current_Character = ','
    BEGIN
    SET @Evaluate = Left(@Results,@Pointer_Working – 1)
    IF @Evaluate = @Search_For
    BEGIN
    SET @Returned = @Comma_Counter
    END
    ELSE
    BEGIN
    SET @Results = SubString(@Results,@Pointer_Working+1,8000)
    SET @Comma_Counter = @Comma_Counter + 1
    SET @Pointer_Working = 0
    END
    END
    END
    RETURN @Returned
    END
    GO

    Gregory Hart

    4 Jul 15 at 05:22

  3. This is really clever. Great job! Thanks for sharing.

    Andrew

    23 Oct 20 at 06:40

Leave a Reply