IN with a comma separated list: SQL Server
From Stack Overflow:
How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?
Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.
One should pass lists in a table variable rather than in a set of comma separated values.
However, there are legacy applications, ORM's and these kinds of stuff that just leave you no choice.
MySQL provides a nice function find_in_set
designed exactly for this task.
SQL Server does not directly support an IN
predicate for CSV
, neither it provides find_in_set
. It needs a rowset for IN
predicate to work.
To expand a comma separated list into a rowset, we will use a recursive CTE
.
In the CTE
, we should search for the position of every next comma and return it.
Then we can just take the substring between two commas and cast it into an INT
:
DECLARE @lid VARCHAR(100) SET @lid = '3, 14, 15, 296, 5358'; WITH cd AS ( SELECT 1 AS first, CHARINDEX(',', @lid, 1) AS next UNION ALL SELECT next + 1, CHARINDEX(',', @lid, next + 1) FROM cd WHERE next > 0 ), lid AS ( SELECT CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id FROM cd ) SELECT * FROM lid
id |
---|
3 |
14 |
15 |
296 |
5358 |
Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.
Let's create a sample table and build a query to do this:
CREATE SCHEMA [20090622_csv] CREATE TABLE t_data ( id INT NOT NULL, val INT NOT NULL ) GO CREATE INDEX ix_data_val ON [20090622_csv].t_data(val) BEGIN TRANSACTION SELECT RAND(20090622) DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 1000000 BEGIN INSERT INTO [20090622_csv].t_data VALUES (@cnt, FLOOR(RAND() * 200000) + 1) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] This table has <strong>1,000,000</strong> rows with column <code>val</code> randomly filled with values from <strong>1</strong> to <strong>200,000</strong>. How do we search it for all values from a comma separated list? We could use the naïve approach, using an <code>OR</code> to combine these two conditions: DECLARE @lid VARCHAR(100) SET @lid = '3, 14, 15926'; WITH cd AS ( SELECT 1 AS first, CHARINDEX(',', @lid, 1) AS next UNION ALL SELECT next + 1, CHARINDEX(',', @lid, next + 1) FROM cd WHERE next > 0 ), lid AS ( SELECT CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id FROM cd ) SELECT * FROM [20090622_csv].t_data WHERE val IN ( SELECT id FROM lid ) OR @lid = '0'
id | val | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
418865 | 14 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
480400 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
484262 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
861197 | 15926 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
246345 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
246484 | 14 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
466169 | 15926 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 rows fetched in 0.0003s (2.7968s) |
Table 'Worktable'. Scan count 6, logical reads 2000052, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_data'. Scan count 3, logical reads 4017, physical reads 0, read-ahead reads 676, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 5422 ms, elapsed time = 2794 ms.
This, however, is vastly inefficient.
For this query, SQL Server builds a plan which scans all table rows, filtering them for this constant value.
The filter, of course, will always return either all rows or no rows at all, and one can tell which on with a single look on the value of @lid, but SQL Server still checks all rows.
To make this query more efficient, we should use an approach similar to one I've used in one of my earlier articles:
We just need to create two mutually exclusive queries, each using the filters combined with an AND operator.
Since SQL Server is much smarter at optimizing AND
's, this should be more efficient.
Here's the query:
DECLARE @lid VARCHAR(100) SET @lid = '3, 14, 15926'; WITH cd AS ( SELECT 1 AS first, CHARINDEX(',', @lid, 1) AS next UNION ALL SELECT next + 1, CHARINDEX(',', @lid, next + 1) FROM cd WHERE next > 0 ), lid AS ( SELECT CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id FROM cd ) SELECT d.* FROM ( SELECT DISTINCT id FROM lid ) l JOIN [20090622_csv].t_data d ON d.val = l.id AND @lid <> '0' UNION ALL SELECT * FROM [20090622_csv].t_data WHERE @lid = '0'
id | val | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
246345 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
480400 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
484262 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
246484 | 14 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
418865 | 14 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
466169 | 15926 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
861197 | 15926 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 rows fetched in 0.0003s (0.0012s) |
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_data'. Scan count 3, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 2, logical reads 20, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.
This is just instant.
Note that I used a JOIN
instead of an IN
predicate for this query.
Though SQL Server usually optimizes an IN
predicate quite well, in this case the subquery would contain a result of a CTE
for which SQL Server just cannot calculate statistics.
It would then revert to a method of choice
for joining two sorted rowsets of unknown size: Nested Loops (Left Semi Join)
with t_data
leading.
This method is quite efficient, but it still requires scanning the whole t_data
which is quite large.
Method of choice for a JOIN
, however, is Nested Loops (Inner Join)
with the subquery leading is this case.
Since the subquery returns but 4 values, they can be looked up instantly using the index on t_data.val
, and the query itself is very fast.
[…] IN with a comma separated list: SQL Server […]
Rokudenashi Blues » Blog Archive » JasperReports
8 Jul 09 at 12:59
CREATE FUNCTION Find_In_Set (@Search_For varChar(8000), @Search_This VarChar(8000))
RETURNS SmallInt AS
BEGIN
DECLARE
@Results VarChar(8000),
@Evaluate VarChar(8000),
@Pointer_Original SmallInt,
@Pointer_Working SmallInt,
@Original_Length SmallInt,
@Comma_Counter SmallInt,
@Current_Character Char(1),
@Returned SmallInt
SET @Pointer_Original = 0
SET @Pointer_Working = 0
SET @Comma_Counter = 1
SET @Returned = 0
SET @Original_Length = Len(@Search_This)
SET @Search_This = RTrim(LTrim(@Search_This))
SET @Results = @Search_This
WHILE @Pointer_Original < @Original_Length
BEGIN
SET @Pointer_Original = @Pointer_Original + 1
SET @Pointer_Working = @Pointer_Working + 1
SET @Current_Character = SubString(@Results,@Pointer_Working,1)
IF @Current_Character = ','
BEGIN
SET @Evaluate = Left(@Results,@Pointer_Working – 1)
IF @Evaluate = @Search_For
BEGIN
SET @Returned = @Comma_Counter
END
ELSE
BEGIN
SET @Results = SubString(@Results,@Pointer_Working+1,8000)
SET @Comma_Counter = @Comma_Counter + 1
SET @Pointer_Working = 0
END
END
END
RETURN @Returned
END
GO
Gregory Hart
4 Jul 15 at 05:22
This is really clever. Great job! Thanks for sharing.
Andrew
23 Oct 20 at 06:40