Archive for April, 2009
Rating movies
Comments enabled. I *really* need your comment
From Stack Overflow:
I have the following query in SQL Server:
SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) GROUP BY CustomerID ORDER BY COUNT(*) DESCIt is super fast, but when I try to use it in a subquery like this:
SELECT * FROM Ratings WHERE MovieID = 1 AND CustomerID IN ( SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) ) GROUP BY CustomerID ORDER BY COUNT(*) DESC, it's getting slow.
Any ideas on why this is so slow and how I can speed it up?
My
PRIMARY KEY
is(MovieID, CustomerID)
and I added a index onCustomerID
This query is in fact a simple but efficient movie rating assistant.
When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.
This is quite a fair rating system, as if a user is into Attack of the Killer Tomatoes!
and Surf Nazis Must Die
, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches Titanic
Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:
- Build a list of movies that the user 915 rated
- Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
- Select
TOP 50
users who had rated most movies of those user 915 had also rated - Show their opinions on movie 1
If we look into this plan, we will see two search conditions. On step 1, we need to select movie
's for a given user
, while on step 2 we need to select user
's for a given movie
.
Since we don't need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns: MovieID
and CustomerID
, in different order.
Since we already have a PRIMARY KEY
on (MovieID, CustomerID)
, we'll need to add just one more index on (CustomerID, MovieID)
.
And since these pairs are unique, we will also make this index UNIQUE
. (My article Making an index UNIQUE describes why you should always declare UNIQUE
index on intrinsically unique data.)
Having two UNIQUE
indexes on the same set of columns may seem a redundancy. And in fact, it is.
But let's remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.
So if two indexes will make our query even more fast, then hail redundancy!
Let's create the sample table:
Read the rest of this entry »
Selecting N records for each group: PostgreSQL
Comments enabled. I *really* need your comment
In one of the previous articles:
, I described how to select a certain number of records for each group in a MySQL table.
This is trivial in SQL Server and Oracle, since an analytic function ROW_NUMBER()
can be used to do this.
Now, I'll describe how to do it in PostgreSQL.
We are assuming that the records from the table should be grouped by a field called grouper
, and for each grouper
, N
first records should be selected, ordered first by ts
, then by id
.
Let's create a sample table:
Read the rest of this entry »
Descending indexes
From Stack Overflow:
When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending.
I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way?
What difference does it make which order I choose?
This matters when the composite indexes are used.
Let's create a sample table:
Read the rest of this entry »
Keeping latest rows for a group
From Stack Overflow:
Assume I have a table
foo
where I have something like this:
id, user_id, timestamp, some_value
What I want to do is remove all rows that aren't the newest
N
per user.The deletion itself could be handled by a query like this:
DELETE FROM foo WHERE id NOT IN (...)so you could rephrase the problem into this: how do I get the newest
N
(there might be less) rows for each user.This means if I have
U
users I may end up withN×U
rows, soLIMIT
won't really work.
Unfortunately, NOT IN
will not work here. MySQL doesn't allow using the target table in an IN
or NOT IN
clause in a DELETE
or UPDATE
statement.
But some time ago I posted two solutions:
- Keeping rows — how to
DELETE
all rows exceptN
- Advanced row sampling — how to select
TOP N
rows for eachGROUP
Now it's time to use these solutions together.
What we need to do here is:
- Select
DISTINCT user_id
's from the table - In a subquery,
SELECT
n'thtimestamp
and n'thid
for eachuser_id
. We need to select both these fields to handle the possible duplicates oftimestamp
's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in aSELECT
clause, that's why we need two subselects (ugly, I know) - Join the resultset with original table, for each
user_id
selecting all rows with(timestamp, id)
lesser than the(n'th timestamp, n'th id)
selected above DELETE
from the joined table. UnlikeIN
andNOT IN
conditions, using the target table in aJOIN
is OK for MySQL
Now, let's create the sample table and see how it works:
This table has 10,000 records for 5 users.
Now, let's try to delete all records except the latest 3 for each user_id
:
DELETE l.* FROM foo l JOIN ( SELECT user_id, COALESCE( ( SELECT timestamp FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC LIMIT 2, 1 ), CAST('0001-01-01' AS DATETIME)) AS mts, COALESCE( ( SELECT id FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC, li.id DESC LIMIT 2, 1 ), -1) AS mid FROM ( SELECT DISTINCT user_id FROM foo dl ) dlo ) lo ON l.user_id = lo.user_id AND (l.timestamp, l.id) < (mts, mid)
We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id
's
Let's see what's left after the query finishes:
SELECT * FROM foo ORDER BY user_id, timestamp, id
id | user_id | timestamp | some_value | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5950 | 1 | 2009-04-25 23:56:41 | Value 5950 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
935 | 1 | 2009-04-25 23:58:52 | Value 935 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 | 1 | 2009-04-25 23:59:07 | Value 95 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8461 | 2 | 2009-04-25 23:57:18 | Value 8461 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4631 | 2 | 2009-04-25 23:58:14 | Value 4631 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6626 | 2 | 2009-04-25 23:59:10 | Value 6626 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7487 | 3 | 2009-04-25 23:55:41 | Value 7487 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
147 | 3 | 2009-04-25 23:56:08 | Value 147 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
27 | 3 | 2009-04-25 23:59:49 | Value 27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2698 | 4 | 2009-04-25 23:59:17 | Value 2698 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1868 | 4 | 2009-04-25 23:59:40 | Value 1868 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7938 | 4 | 2009-04-25 23:59:56 | Value 7938 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8929 | 5 | 2009-04-25 23:57:59 | Value 8929 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1934 | 5 | 2009-04-25 23:58:19 | Value 1934 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6179 | 5 | 2009-04-25 23:58:22 | Value 6179 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 rows fetched in 0.0005s (0.0020s) |
Just like requested: 15 records left, 3 latest records for each user_id
.
Selecting timestamps for a time zone
Comments enabled. I *really* need your comment
From Stack Overflow:
How can I reuse a computed column in an SQL query in MySQL?
My query is something like this:
SELECT CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, theCONVERT_TZ (…)
part is repeated.This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.
Is there any way to avoid such duplication?
The duplication part is simple. We just need to wrap the query in a subquery or use HAVING
clause, like this:
SELECT * FROM ( SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id ) q WHERE time_usr_tz < CURRENT_TIMESTAMP()
, or this:
SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id HAVING time_usr_tz < CURRENT_TIMESTAMP()
But what I'd really like to write about today is the performance of such a query.
This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.
The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.
That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.
The query uses columns from both tables inside a function in the WHERE
condition, that's why no indexes can be used to filter the timestamps.
If schedule
is large enough, this can be a problem for a busy service.
Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only timestamp_start
column in the schedules
, which is enough to illustrate the point.
Read the rest of this entry »
Selecting lowest value
Comments enabled. I *really* need your comment
From Stack Overflow:
Say I have the following data:
Name Value Small 10 Medium 100 Large 1000 Imagine that these represent the volumes of boxes.
I have some items that I want to put in the boxes, and I want the smallest box possible.
I need an
SQL
query that will:
- Return the row with the smallest row greater than my query parameter
- If there is no such row, then return the largest row
I'm on Oracle 11g, so any special Oracle goodness is OK.
There certainly is a special Oracle goodness, and I'll cover it in this article.
Let's create the sample table and fill it with 1,000,000 rows to illustrate the point:
Read the rest of this entry »
Making an index UNIQUE
From Stack Overflow:
In SQL Server, if I know an index will have unique values, how will it affect performance on
INSERT
's orSELECT
's if I declare it as such?If the optimiser knows the index is
UNIQUE
, how will that affect the query plan?I understand that specifying uniqueness can serve to preserve integrity, but leaving that discussion aside for the moment, what are the perfomance consequences?
Let's see how does a UNIQUE
index perform compared to a non-UNIQUE
one.
Indexes mentioned hereafter are considered to be built on a same intrinsically unique dataset. I will leave consistency issues aside, focusing only on performance details.
A UNIQUE
index, contrary to popular opinion, does not have any performance drawbacks on DML
compared to a plain, non-UNIQUE
index.
When the engine tries to insert a new row into the indexed table, being the index UNIQUE
or not, it performs the following steps:
Read the rest of this entry »
Counting missing rows: PostgreSQL
Comments enabled. I *really* need your comment
This is the 5th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Finally let's look how PostgreSQL
copes with this predicate.
Let's create sample tables:
Read the rest of this entry »
Counting missing rows: Oracle
Comments enabled. I *really* need your comment
This is the 4th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Let's see how Oracle processes this predicate.
Let's create the sample tables:
Read the rest of this entry »
Counting missing rows: SQL Server
Comments enabled. I *really* need your comment
This is the 3rd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Now, let's see how the same predicate is implemented in SQL Server.
We'll create sample tables just like those we used yesterday:
Read the rest of this entry »