SQL Server: composite indexes improve movie rating query at EXPLAIN EXTENDED

I have the following query in SQL Server:

SELECT  TOP 50 CustomerID FROM Ratings
WHERE   CustomerID != 915
        AND MovieID IN
        (
        SELECT DISTINCT MovieID
        FROM Ratings
        WHERE CustomerID = 915
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

It is super fast, but when I try to use it in a subquery like this:

SELECT  *
FROM    Ratings
WHERE   MovieID = 1
        AND CustomerID IN
        (
        SELECT  TOP 50 CustomerID FROM Ratings
        WHERE   CustomerID != 915
                AND MovieID IN
                (
                SELECT DISTINCT MovieID
                FROM Ratings
                WHERE CustomerID = 915
                )
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

, it's getting slow.

Any ideas on why this is so slow and how I can speed it up?

My PRIMARY KEY is (MovieID, CustomerID) and I added a index on CustomerID

This query is in fact a simple but efficient movie rating assistant.

When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.

This is quite a fair rating system, as if a user is into Attack of the Killer Tomatoes! and Surf Nazis Must Die, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches Titanic

Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:

Build a list of movies that the user 915 rated
Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
Select TOP 50 users who had rated most movies of those user 915 had also rated
Show their opinions on movie 1

If we look into this plan, we will see two search conditions. On step 1, we need to select movie's for a given user, while on step 2 we need to select user's for a given movie.

Since we don't need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns: MovieID and CustomerID, in different order.

Since we already have a PRIMARY KEY on (MovieID, CustomerID), we'll need to add just one more index on (CustomerID, MovieID).

And since these pairs are unique, we will also make this index UNIQUE. (My article Making an index UNIQUE describes why you should always declare UNIQUE index on intrinsically unique data.)

Having two UNIQUE indexes on the same set of columns may seem a redundancy. And in fact, it is.

But let's remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.

So if two indexes will make our query even more fast, then hail redundancy!

Let's create the sample table:

Table creation details

CREATE SCHEMA [20090430_movies]
CREATE TABLE Ratings (
        CustomerID INT NOT NULL,
        MovieID INT NOT NULL,
        Rating INT NOT NULL
)
GO

ALTER TABLE [20090430_movies].Ratings
ADD CONSTRAINT PK_Ratings_Movie_Customer PRIMARY KEY (MovieID, CustomerID)
CREATE UNIQUE INDEX UX_Ratings_Customer_Movie ON [20090430_movies].Ratings (CustomerID, MovieID)
GO

DECLARE @cnt INT;
DECLARE @movie INT;
DECLARE @left FLOAT;
SET @cnt = 1;
BEGIN TRANSACTION
SELECT  RAND(20090430)
WHILE @cnt <= 1000
BEGIN
        SET @left = 1000;
        SET @movie = 10000;
        WHILE @movie > 0
        BEGIN
                IF RAND() < (@left / @movie)
                BEGIN
                        INSERT
                        INTO    &#91;20090430_movies&#93;.Ratings (MovieID, CustomerID, Rating)
                        VALUES  (@movie, @cnt, RAND() * 5 + 1)
                        SET @left = @left - 1
                END
                SET @movie = @movie - 1
        END
        SET @cnt = @cnt + 1
END
COMMIT
GO
&#91;/sourcecode&#93;
</div>

This table contains ratings from <strong>1,000</strong> <code>customer</code>'s each of whom had randomly rated <strong>1,000</strong> <code>movie</code>'s. There are, therefore, <strong>1,000,000</strong> records in the table.

We have a <code>CLUSTERED PRIMARY KEY</code> on <code>(MovieID, CustomerID)</code> and a <code>UNIQUE INDEX</code> on <code>(CustomerID, MovieID)</code>.

Now, the query:


SELECT  r2.*
FROM    (
        SELECT  TOP 50 CustomerID
        FROM    (
                SELECT  MovieID
                FROM    [20090430_movies].Ratings
                WHERE   CustomerID = 915
                ) q
        JOIN    [20090430_movies].Ratings r
        ON      r.MovieID = q.MovieID
                AND CustomerID <> 915
                AND EXISTS
                (
                SELECT  1
                FROM    [20090430_movies].Ratings re
                WHERE   re.MovieID = 1
                        AND re.CustomerID = r.CustomerID
                )
        GROUP BY
                CustomerID
        ORDER BY
                COUNT(*) DESC
        ) ro
JOIN    [20090430_movies].Ratings r2
ON      r2.MovieID = 1
        AND r2.CustomerID = ro.CustomerID

CustomerID	MovieID	Rating
919	1	5
301	1	5
912	1	4
829	1	5
925	1	1
963	1	1
546	1	3
322	1	2
516	1	3
419	1	4
785	1	1
365	1	5
607	1	2
876	1	5
868	1	3
193	1	4
518	1	4
37	1	4
212	1	2
869	1	1
638	1	3
83	1	1
137	1	3
690	1	3
965	1	4
36	1	3
249	1	3
747	1	3
589	1	1
773	1	2
447	1	4
898	1	1
40	1	1
215	1	1
802	1	4
268	1	3
294	1	4
908	1	2
124	1	2
751	1	2
190	1	2
121	1	4
611	1	2
684	1	5
860	1	2
482	1	3
455	1	4
943	1	1
601	1	5
913	1	4
50 rows fetched in 0.0017s (0.1843s)

Table 'Ratings'. Scan count 112, logical reads 879, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 109 ms,  elapsed time = 110 ms.

To obtain the resultset aliased as ro, optimizer uses a HASH JOIN between the results of two INDEX SEEK's from two indexes on the same columns, but in different orders.

Since all the data are contained in the indexes and INDEX SEEK's can be used to retrieve the data, less than a thousand logical reads are required, and the query completes in only 110 ms.

Written by Quassnoi

April 30th, 2009 at 11:00 pm

Posted in SQL Server

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED

Rating movies

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Rating movies

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow