EXPLAIN EXTENDED

How to create fast database queries

Selecting compatible articles

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).

So for a list of article numbers A, B,… , N the question is:

Give me all articles that are compatible with A and B and … and N

For example, consider the table

A B
1 2
3 1
3 4

If I wanted all articles that are compatible with 1, the query would return (2, 3).

The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.

This table describes a friendship: a symmetric irreflexive binary relation.

That is:

  • For any given a, b, if a is a friend to b, then b is a friend to a
  • For any given a, a is never a friend to itself

This relation is heavily used by social networks.

A normalized table describing this relation should be defined like this:

CREATE TABLE t_set (
        a INT NOT NULL,
        b INT NOT NULL
)
ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b)
&#91;/sourcecode&#93;

, the check being added to account for the relation symmetry.

Complete relation can be retrieved with the following query:
<span id="more-2050"></span>

SELECT  a AS id, b AS friend
FROM     t_set
UNION ALL
SELECT  b, a
FROM    t_set

, which returns all possible combinations for which the friend is a friend of the id

In some systems, namely MySQL, keeping the relation in its denormalized form can yield better performance, as I have shown in one of my previous articles:

In this article, however, I would like to consider two different approaches that can solve the task described above in SQL Server.

To find all articles that are compatible with every item from a given list, we can implement two approaches.

  • A pure set-based approach: return each compatibility record for each item from the list, group by articles and count. Articles that are compatible with every item in the list will be returned as many times as many items there are in the list.
  • A recursive approach: find all compatible articles for the first item in the list, and recursively iterate the list, checking that the article that is compatible with the first item, is also compatible with the second one, third one and so on. An article that made it to the last step in recursion is compatible with every item in the list.

Let's create a sample table and see which approach is more efficient:

CREATE SCHEMA [20090710_compatible]
CREATE TABLE t_set (
        a INT NOT NULL,
        b INT NOT NULL
)
GO
ALTER TABLE [20090710_compatible].t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE [20090710_compatible].t_set ADD CONSTRAINT ck_set_ab CHECK (a < b)
CREATE INDEX ix_set_b ON &#91;20090710_compatible&#93;.t_set (b)
GO
BEGIN TRANSACTION
SELECT  RAND(20090710)
DECLARE @cnt INT
DECLARE @av INT
DECLARE @bv INT
DECLARE @a INT
DECLARE @b INT
SET @cnt = 1
WHILE @cnt <= 1000000
BEGIN
        SET @av = FLOOR(RAND() * 999) + 1 
        SET @bv = FLOOR(RAND() * 999) + 1
        SET @a = CASE WHEN @av < @bv THEN @av ELSE @bv END
        SET @b = CASE WHEN @av < @bv THEN @bv ELSE @av END
        IF NOT EXISTS
        (
        SELECT  1
        FROM    &#91;20090710_compatible&#93;.t_set
        WHERE   a = @a
                AND b = @b
        ) AND @a <> @b
        BEGIN
                INSERT
                INTO    [20090710_compatible].t_set (a, b)
                VALUES  (@a, @b)
        END
        SET @cnt = @cnt + 1
END
COMMIT

This normalized table has 431.197 records which describe random compatibilities between articles with ids from 1 to 1,000.

To keep the items for the articles to be compatible with we will use a temporary table instead of a constant list. We will also need the total number of items in the list and this design will help us to find it more easily.

We will first fill the list with the sample set of items (i. e. all id's from 1 to 50), then calculate the list size and cache it in the variable.

Then will will find articles compatible with any item in the list. The will return us all compatible articles for each item, some of articles being compatible with more than of item and therefore being returned more than once.

If there are 50 items in the list and an article is returned 50 times, it means that it is compatible with every item in the list and therefore satisifies the conditions for our task. We just need to aggregate the id's, take COUNT(*) and compare it to the count of items calculated and cached earlier.

Compatibility table always stores the article with the least id in column A, and the article with the greatest id in column B.

That's why we will need to make a UNION ALL of two queries to retrieve all records, first query searching for the item in column A and returning B as an id of the compatible article, second one doing the opposite.

Here's the query:

DECLARE @list TABLE
        (
        id INT PRIMARY KEY
        )
;WITH   sets AS
        (
        SELECT  1 AS cnt
        UNION ALL
        SELECT  cnt + 1
        FROM    sets
        WHERE   cnt < 50
        )
INSERT
INTO    @list
SELECT  cnt
FROM    sets

DECLARE @lcnt INT
SELECT  @lcnt = COUNT(*)
FROM    @list

SELECT  id, COUNT(*)
FROM    (
        SELECT  a AS id
        FROM    &#91;20090710_compatible&#93;.t_set
        WHERE   b IN
                (
                SELECT  id
                FROM    @list
                )
        UNION ALL
        SELECT  b
        FROM    &#91;20090710_compatible&#93;.t_set
        WHERE   a IN
                (
                SELECT  id
                FROM    @list
                )
        ) q
GROUP BY
        id
HAVING  COUNT(*) = @lcnt
ORDER BY
        id

&#91;/sourcecode&#93;

<div class="terminal">
<table class="terminal">
<tr>
<th>id</th><th></th></tr>
<tr><td class="int">566</td><td class="int">50</td></tr>
<tr class="statusbar"><td colspan="100">1 row fetched in 0.0002s (0.0627s)</td></tr>
</table>
</div>

<pre>
Table 't_set'. Scan count 100, logical reads 573, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table '#58F2C25C'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 59 ms. 
</pre>

This query runs for <strong>59 ms</strong>.

Now, let's see how a recursive approach works.

First, we enumerate all items in the list using <strong>ROW_NUMBER()</strong> and wrap the results into a subquery.

In the initial part of a recursive <strong>CTE</strong>, we will search for all articles that are compatible with the first item in the list. This is also done using <strong>UNION ALL</strong>.

In the recursive part of the <strong>CTE</strong>, we will take each articles we retrieved in the initial part and compare it against the second item of the list, third item etc. The number of item we are comparing to is defined by the number of the recursion step that we will calculate along with the other values.

An item that fails the check will break the recursion. That means that we should return only those items that made it to the last possible recursion step with the number equal to the number of items in the list.

Here's the recursive query:


SET STATISTICS TIME ON
SET STATISTICS IO ON
DECLARE @list TABLE
        (
        id INT PRIMARY KEY
        )
;WITH   sets AS
        (
        SELECT  1 AS cnt
        UNION ALL
        SELECT  cnt + 1
        FROM    sets
        WHERE   cnt < 50
        )
INSERT
INTO    @list
SELECT  cnt
FROM    sets

DECLARE @lcnt INT
SELECT  @lcnt = COUNT(*)
FROM    @list

;WITH    l AS
        (
        SELECT  id, ROW_NUMBER() OVER (ORDER BY id) AS rn
        FROM    @list
        ),
        q AS
        (
        SELECT  l.id AS lid, s.b AS sid, 1 AS rn
        FROM    l
        JOIN    &#91;20090710_compatible&#93;.t_set s
        ON      s.a = l.id
        WHERE   l.rn = 1
        UNION ALL
        SELECT  l.id, s.a, 1
        FROM    l
        JOIN    &#91;20090710_compatible&#93;.t_set s
        ON      s.b = l.id
        WHERE   l.rn = 1
        UNION ALL
        SELECT  l.id, q.sid, q.rn + 1
        FROM    q
        JOIN    l
        ON      l.rn = q.rn + 1
        JOIN    &#91;20090710_compatible&#93;.t_set s
        ON      s.a = CASE WHEN q.sid < l.id THEN q.sid ELSE l.id END
                AND s.b  = CASE WHEN q.sid > l.id THEN q.sid ELSE l.id END
        )
SELECT  *
FROM    q
WHERE   rn = (
        SELECT  COUNT(*)
        FROM    @list
        )
ORDER BY
        sid

lid sid rn
50 566 50
1 row fetched in 0.0002s (0.2081s)
Table 'Worktable'. Scan count 2, logical reads 37279, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_set'. Scan count 6493, logical reads 19487, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table '#76832543'. Scan count 6495, logical reads 12990, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 203 ms,  elapsed time = 204 ms. 

We see that this query is far less efficient than a GROUP BY one.

There are two reasons for this:

  • The recursive query is in fact a NESTED LOOP while the aggregating query uses more efficient HASH JOIN that performs better when you have to join whole resultsets
  • The recursive query cannot be efficiently parallelized

However, the recursive query has one advantage: it performs less steps and filters out non-matching articles quite early, while the aggregating query always aggregates all articles and the filters out the non-matching ones.

Let's change the initial condition a little, and fill the list with all items article 1 is compatible with. This will give us a list of 860 items.

Here's the aggregating query:

SET STATISTICS TIME ON
SET STATISTICS IO ON
DECLARE @list TABLE
        (
        id INT PRIMARY KEY
        )
INSERT
INTO    @list
SELECT  b
FROM    [20090710_compatible].t_set
WHERE   a = 1

DECLARE @lcnt INT
SELECT  @lcnt = COUNT(*)
FROM    @list

SELECT  id, COUNT(*)
FROM    (
        SELECT  a AS id
        FROM    [20090710_compatible].t_set
        WHERE   b IN
                (
                SELECT  id
                FROM    @list
                )
        UNION ALL
        SELECT  b
        FROM    [20090710_compatible].t_set
        WHERE   a IN
                (
                SELECT  id
                FROM    @list
                )
        ) q
GROUP BY
        id
HAVING  COUNT(*) = @lcnt
ORDER BY
        id

id
1 860
1 row fetched in 0.0002s (2.3281s)
Table 't_set'. Scan count 1720, logical reads 9971, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table '#04D1449A'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1078 ms,  elapsed time = 2310 ms. 

We see that it runs for more than 2.3 seconds, since it needs to aggregate way too many values: about 750 items for any given article.

And here is the recursive query:

SET STATISTICS TIME ON
SET STATISTICS IO ON
DECLARE @list TABLE
        (
        id INT PRIMARY KEY
        )
INSERT
INTO    @list
SELECT  b
FROM    [20090710_compatible].t_set
WHERE   a = 1

DECLARE @lcnt INT
SELECT  @lcnt = COUNT(*)
FROM    @list

;WITH    l AS
        (
        SELECT  id, ROW_NUMBER() OVER (ORDER BY id) AS rn
        FROM    @list
        ),
        q AS
        (
        SELECT  l.id AS lid, s.b AS sid, 1 AS rn
        FROM    l
        JOIN    [20090710_compatible].t_set s
        ON      s.a = l.id
        WHERE   l.rn = 1
        UNION ALL
        SELECT  l.id, s.a, 1
        FROM    l
        JOIN    [20090710_compatible].t_set s
        ON      s.b = l.id
        WHERE   l.rn = 1
        UNION ALL
        SELECT  l.id, q.sid, q.rn + 1
        FROM    q
        JOIN    l
        ON      l.rn = q.rn + 1
        JOIN    [20090710_compatible].t_set s
        ON      s.a = CASE WHEN q.sid < l.id THEN q.sid ELSE l.id END
                AND s.b  = CASE WHEN q.sid > l.id THEN q.sid ELSE l.id END
        )
SELECT  *
FROM    q
WHERE   rn = (
        SELECT  COUNT(*)
        FROM    @list
        )
ORDER BY
        sid
OPTION  (MAXRECURSION 0)

lid sid rn
999 1 860
1 row fetched in 0.0002s (0.4138s)
Table 'Worktable'. Scan count 2, logical reads 43728, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_set'. Scan count 7572, logical reads 22726, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table '#1E91169D'. Scan count 7574, logical reads 15390, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 406 ms,  elapsed time = 408 ms. 

Only 0.4 seconds, much faster.

This is because on each step it deals only with articles that were not yet filtered out.

An article gets filtered out quite quickly, that's why this solutions is better for the big item lists.

Summary:

To find all articles that are compatible to all items from a given list, we need to find out how many items does it satisfy.

There are two ways to do this in SQL Server:

  1. Retrieve all articles compatible with all items, count them and return those articles having COUNT(*) equal to the number of items in the list.
    This colution is more parallelization friendly and can employ more efficient JOIN algorithms, that's why it's better for checking against small item lists.
  2. Retrieve all articles compatible with the first item from the list and recursively check them against all other items in the list. Whenever a check is failed, the recursion is broken. We need to record the recursion step number on each step and return only those articles that made it to the last recursion step, that is the the step with same number as the number of items in the list.
    This solution performs less comparisons and therefore is more efficient for checking against large item lists, where the reduced number of comparison overweights the benefits of parallelization and HASH JOINs.

Written by Quassnoi

July 10th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply