Selecting compatible articles
Comments enabled. I *really* need your comment
From Stack Overflow:
I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).
So for a list of article numbers
A, B,… , N
the question is:Give me all articles that are compatible with
A and B and … and N
For example, consider the table
A B 1 2 3 1 3 4 If I wanted all articles that are compatible with 1, the query would return (2, 3).
The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.
This table describes a friendship: a symmetric irreflexive binary relation.
That is:
- For any given
a
,b
, ifa
is a friend tob
, thenb
is a friend toa
- For any given
a
,a
is never a friend to itself
This relation is heavily used by social networks.
A normalized table describing this relation should be defined like this:
CREATE TABLE t_set ( a INT NOT NULL, b INT NOT NULL ) ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b) ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b) [/sourcecode] , the check being added to account for the relation symmetry. Complete relation can be retrieved with the following query: <span id="more-2050"></span> SELECT a AS id, b AS friend FROM t_set UNION ALL SELECT b, a FROM t_set
, which returns all possible combinations for which the friend is a friend of the id
In some systems, namely MySQL, keeping the relation in its denormalized form can yield better performance, as I have shown in one of my previous articles:
In this article, however, I would like to consider two different approaches that can solve the task described above in SQL Server.
To find all articles that are compatible with every item from a given list, we can implement two approaches.
- A pure set-based approach: return each compatibility record for each item from the list, group by articles and count. Articles that are compatible with every item in the list will be returned as many times as many items there are in the list.
- A recursive approach: find all compatible articles for the first item in the list, and recursively iterate the list, checking that the article that is compatible with the first item, is also compatible with the second one, third one and so on. An article that made it to the last step in recursion is compatible with every item in the list.
Let's create a sample table and see which approach is more efficient:
CREATE SCHEMA [20090710_compatible] CREATE TABLE t_set ( a INT NOT NULL, b INT NOT NULL ) GO ALTER TABLE [20090710_compatible].t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b) ALTER TABLE [20090710_compatible].t_set ADD CONSTRAINT ck_set_ab CHECK (a < b) CREATE INDEX ix_set_b ON [20090710_compatible].t_set (b) GO BEGIN TRANSACTION SELECT RAND(20090710) DECLARE @cnt INT DECLARE @av INT DECLARE @bv INT DECLARE @a INT DECLARE @b INT SET @cnt = 1 WHILE @cnt <= 1000000 BEGIN SET @av = FLOOR(RAND() * 999) + 1 SET @bv = FLOOR(RAND() * 999) + 1 SET @a = CASE WHEN @av < @bv THEN @av ELSE @bv END SET @b = CASE WHEN @av < @bv THEN @bv ELSE @av END IF NOT EXISTS ( SELECT 1 FROM [20090710_compatible].t_set WHERE a = @a AND b = @b ) AND @a <> @b BEGIN INSERT INTO [20090710_compatible].t_set (a, b) VALUES (@a, @b) END SET @cnt = @cnt + 1 END COMMIT
This normalized table has 431.197 records which describe random compatibilities between articles with id
s from 1 to 1,000.
To keep the items for the articles to be compatible with we will use a temporary table instead of a constant list. We will also need the total number of items in the list and this design will help us to find it more easily.
We will first fill the list with the sample set of items (i. e. all id
's from 1 to 50), then calculate the list size and cache it in the variable.
Then will will find articles compatible with any item in the list. The will return us all compatible articles for each item, some of articles being compatible with more than of item and therefore being returned more than once.
If there are 50 items in the list and an article is returned 50 times, it means that it is compatible with every item in the list and therefore satisifies the conditions for our task. We just need to aggregate the id
's, take COUNT(*)
and compare it to the count of items calculated and cached earlier.
Compatibility table always stores the article with the least id
in column A
, and the article with the greatest id
in column B
.
That's why we will need to make a UNION ALL
of two queries to retrieve all records, first query searching for the item in column A
and returning B
as an id
of the compatible article, second one doing the opposite.
Here's the query:
DECLARE @list TABLE ( id INT PRIMARY KEY ) ;WITH sets AS ( SELECT 1 AS cnt UNION ALL SELECT cnt + 1 FROM sets WHERE cnt < 50 ) INSERT INTO @list SELECT cnt FROM sets DECLARE @lcnt INT SELECT @lcnt = COUNT(*) FROM @list SELECT id, COUNT(*) FROM ( SELECT a AS id FROM [20090710_compatible].t_set WHERE b IN ( SELECT id FROM @list ) UNION ALL SELECT b FROM [20090710_compatible].t_set WHERE a IN ( SELECT id FROM @list ) ) q GROUP BY id HAVING COUNT(*) = @lcnt ORDER BY id [/sourcecode] <div class="terminal"> <table class="terminal"> <tr> <th>id</th><th></th></tr> <tr><td class="int">566</td><td class="int">50</td></tr> <tr class="statusbar"><td colspan="100">1 row fetched in 0.0002s (0.0627s)</td></tr> </table> </div> <pre> Table 't_set'. Scan count 100, logical reads 573, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#58F2C25C'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 47 ms, elapsed time = 59 ms. </pre> This query runs for <strong>59 ms</strong>. Now, let's see how a recursive approach works. First, we enumerate all items in the list using <strong>ROW_NUMBER()</strong> and wrap the results into a subquery. In the initial part of a recursive <strong>CTE</strong>, we will search for all articles that are compatible with the first item in the list. This is also done using <strong>UNION ALL</strong>. In the recursive part of the <strong>CTE</strong>, we will take each articles we retrieved in the initial part and compare it against the second item of the list, third item etc. The number of item we are comparing to is defined by the number of the recursion step that we will calculate along with the other values. An item that fails the check will break the recursion. That means that we should return only those items that made it to the last possible recursion step with the number equal to the number of items in the list. Here's the recursive query: SET STATISTICS TIME ON SET STATISTICS IO ON DECLARE @list TABLE ( id INT PRIMARY KEY ) ;WITH sets AS ( SELECT 1 AS cnt UNION ALL SELECT cnt + 1 FROM sets WHERE cnt < 50 ) INSERT INTO @list SELECT cnt FROM sets DECLARE @lcnt INT SELECT @lcnt = COUNT(*) FROM @list ;WITH l AS ( SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS rn FROM @list ), q AS ( SELECT l.id AS lid, s.b AS sid, 1 AS rn FROM l JOIN [20090710_compatible].t_set s ON s.a = l.id WHERE l.rn = 1 UNION ALL SELECT l.id, s.a, 1 FROM l JOIN [20090710_compatible].t_set s ON s.b = l.id WHERE l.rn = 1 UNION ALL SELECT l.id, q.sid, q.rn + 1 FROM q JOIN l ON l.rn = q.rn + 1 JOIN [20090710_compatible].t_set s ON s.a = CASE WHEN q.sid < l.id THEN q.sid ELSE l.id END AND s.b = CASE WHEN q.sid > l.id THEN q.sid ELSE l.id END ) SELECT * FROM q WHERE rn = ( SELECT COUNT(*) FROM @list ) ORDER BY sid
lid | sid | rn | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50 | 566 | 50 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.2081s) |
Table 'Worktable'. Scan count 2, logical reads 37279, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_set'. Scan count 6493, logical reads 19487, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#76832543'. Scan count 6495, logical reads 12990, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 203 ms, elapsed time = 204 ms.
We see that this query is far less efficient than a GROUP BY
one.
There are two reasons for this:
- The recursive query is in fact a
NESTED LOOP
while the aggregating query uses more efficientHASH JOIN
that performs better when you have to join whole resultsets - The recursive query cannot be efficiently parallelized
However, the recursive query has one advantage: it performs less steps and filters out non-matching articles quite early, while the aggregating query always aggregates all articles and the filters out the non-matching ones.
Let's change the initial condition a little, and fill the list with all items article 1 is compatible with. This will give us a list of 860 items.
Here's the aggregating query:
SET STATISTICS TIME ON SET STATISTICS IO ON DECLARE @list TABLE ( id INT PRIMARY KEY ) INSERT INTO @list SELECT b FROM [20090710_compatible].t_set WHERE a = 1 DECLARE @lcnt INT SELECT @lcnt = COUNT(*) FROM @list SELECT id, COUNT(*) FROM ( SELECT a AS id FROM [20090710_compatible].t_set WHERE b IN ( SELECT id FROM @list ) UNION ALL SELECT b FROM [20090710_compatible].t_set WHERE a IN ( SELECT id FROM @list ) ) q GROUP BY id HAVING COUNT(*) = @lcnt ORDER BY id
id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 860 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (2.3281s) |
Table 't_set'. Scan count 1720, logical reads 9971, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#04D1449A'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1078 ms, elapsed time = 2310 ms.
We see that it runs for more than 2.3 seconds, since it needs to aggregate way too many values: about 750 items for any given article.
And here is the recursive query:
SET STATISTICS TIME ON SET STATISTICS IO ON DECLARE @list TABLE ( id INT PRIMARY KEY ) INSERT INTO @list SELECT b FROM [20090710_compatible].t_set WHERE a = 1 DECLARE @lcnt INT SELECT @lcnt = COUNT(*) FROM @list ;WITH l AS ( SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS rn FROM @list ), q AS ( SELECT l.id AS lid, s.b AS sid, 1 AS rn FROM l JOIN [20090710_compatible].t_set s ON s.a = l.id WHERE l.rn = 1 UNION ALL SELECT l.id, s.a, 1 FROM l JOIN [20090710_compatible].t_set s ON s.b = l.id WHERE l.rn = 1 UNION ALL SELECT l.id, q.sid, q.rn + 1 FROM q JOIN l ON l.rn = q.rn + 1 JOIN [20090710_compatible].t_set s ON s.a = CASE WHEN q.sid < l.id THEN q.sid ELSE l.id END AND s.b = CASE WHEN q.sid > l.id THEN q.sid ELSE l.id END ) SELECT * FROM q WHERE rn = ( SELECT COUNT(*) FROM @list ) ORDER BY sid OPTION (MAXRECURSION 0)
lid | sid | rn | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999 | 1 | 860 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (0.4138s) |
Table 'Worktable'. Scan count 2, logical reads 43728, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_set'. Scan count 7572, logical reads 22726, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#1E91169D'. Scan count 7574, logical reads 15390, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 406 ms, elapsed time = 408 ms.
Only 0.4 seconds, much faster.
This is because on each step it deals only with articles that were not yet filtered out.
An article gets filtered out quite quickly, that's why this solutions is better for the big item lists.
Summary:
To find all articles that are compatible to all items from a given list, we need to find out how many items does it satisfy.
There are two ways to do this in SQL Server:
- Retrieve all articles compatible with all items, count them and return those articles having
COUNT(*)
equal to the number of items in the list.
This colution is more parallelization friendly and can employ more efficientJOIN
algorithms, that's why it's better for checking against small item lists. - Retrieve all articles compatible with the first item from the list and recursively check them against all other items in the list. Whenever a check is failed, the recursion is broken. We need to record the recursion step number on each step and return only those articles that made it to the last recursion step, that is the the step with same number as the number of items in the list.
This solution performs less comparisons and therefore is more efficient for checking against large item lists, where the reduced number of comparison overweights the benefits of parallelization andHASH JOIN
s.