From Stack Overflow:
I have the following query in SQL Server:SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) GROUP BY CustomerID ORDER BY COUNT(*) DESC
It is super fast, but when I try to use it in a subquery like this:SELECT * FROM Ratings WHERE MovieID = 1 AND CustomerID IN ( SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) ) GROUP BY CustomerID ORDER BY COUNT(*) DESC
, it’s getting slow.
Any ideas on why this is so slow and how I can speed it up?
(MovieID, CustomerID)and I added a index on
This query is in fact a simple but efficient movie rating assistant.
When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.
This is quite a fair rating system, as if a user is into
Attack of the Killer Tomatoes! and
Surf Nazis Must Die, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches
Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:
- Build a list of movies that the user 915 rated
- Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
TOP 50users who had rated most movies of those user 915 had also rated
- Show their opinions on movie 1
If we look into this plan, we will see two search conditions. On step 1, we need to select
movie‘s for a given
user, while on step 2 we need to select
user‘s for a given
Since we don’t need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns:
CustomerID, in different order.
Since we already have a
PRIMARY KEY on
(MovieID, CustomerID), we’ll need to add just one more index on
And since these pairs are unique, we will also make this index
UNIQUE. (My article Making an index UNIQUE describes why you should always declare
UNIQUE index on intrinsically unique data.)
UNIQUE indexes on the same set of columns may seem a redundancy. And in fact, it is.
But let’s remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.
So if two indexes will make our query even more fast, then hail redundancy!
Let’s create the sample table:
Table creation details
CREATE SCHEMA [20090430_movies] CREATE TABLE Ratings ( CustomerID INT NOT NULL, MovieID INT NOT NULL, Rating INT NOT NULL ) GO ALTER TABLE [20090430_movies].Ratings ADD CONSTRAINT PK_Ratings_Movie_Customer PRIMARY KEY (MovieID, CustomerID) CREATE UNIQUE INDEX UX_Ratings_Customer_Movie ON [20090430_movies].Ratings (CustomerID, MovieID) GO DECLARE @cnt INT; DECLARE @movie INT; DECLARE @left FLOAT; SET @cnt = 1; BEGIN TRANSACTION SELECT RAND(20090430) WHILE @cnt <= 1000 BEGIN SET @left = 1000; SET @movie = 10000; WHILE @movie > 0 BEGIN IF RAND() < (@left / @movie) BEGIN INSERT INTO [20090430_movies].Ratings (MovieID, CustomerID, Rating) VALUES (@movie, @cnt, RAND() * 5 + 1) SET @left = @left - 1 END SET @movie = @movie - 1 END SET @cnt = @cnt + 1 END COMMIT GO
This table contains ratings from 1,000
customer‘s each of whom had randomly rated 1,000
movie‘s. There are, therefore, 1,000,000 records in the table.
We have a
CLUSTERED PRIMARY KEY on
(MovieID, CustomerID) and a
UNIQUE INDEX on
Now, the query:
SELECT r2.* FROM ( SELECT TOP 50 CustomerID FROM ( SELECT MovieID FROM [20090430_movies].Ratings WHERE CustomerID = 915 ) q JOIN [20090430_movies].Ratings r ON r.MovieID = q.MovieID AND CustomerID <> 915 AND EXISTS ( SELECT 1 FROM [20090430_movies].Ratings re WHERE re.MovieID = 1 AND re.CustomerID = r.CustomerID ) GROUP BY CustomerID ORDER BY COUNT(*) DESC ) ro JOIN [20090430_movies].Ratings r2 ON r2.MovieID = 1 AND r2.CustomerID = ro.CustomerID
|50 rows fetched in 0.0017s (0.1843s)|
Table 'Ratings'. Scan count 112, logical reads 879, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 109 ms, elapsed time = 110 ms.
To obtain the resultset aliased as
ro, optimizer uses a
HASH JOIN between the results of two
INDEX SEEK‘s from two indexes on the same columns, but in different orders.
Since all the data are contained in the indexes and
INDEX SEEK‘s can be used to retrieve the data, less than a thousand logical reads are required, and the query completes in only 110 ms.