Latest visitors
Comments enabled. I *really* need your comment
From Stack Overflow:
Say you want to display the latest visitors on a users profile page.
How would you structure this?
Perhaps a table called
uservisitors
:
uservisitors userid visitorid time And how would you select this with MySQL without any duplicates?
What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.
There are two approaches to this.
First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:
SELECT visitorid, MAX(time) AS lastvisit FROM uservisitors WHERE userid = 1 GROUP BY userid, visitorid ORDER BY lastvisit DESC LIMIT 5
However, there is a little problem with this solution.
Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY
optimization for this query, it will still have to sort on MAX(time)
to find 5 latest records.
This will require sorting the whole resultset which will be huge if the service is heavily loaded.
Let's test it on a sample table:
Table creation details
This table has 1,000,000 records for 2 different users with 100,000 unique visitors. This gives 5 visits to each user's page per visitor.
This reflects real distribution of the data on an actual site quite well.
The query we initially came up with gives the following results:
SELECT visitorid, MAX(time) AS lastvisit FROM uservisitors WHERE userid = 2 GROUP BY userid, visitorid ORDER BY lastvisit DESC LIMIT 5
visitorid | lastvisit | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2009-01-11 23:58:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 2009-01-11 23:56:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | 2009-01-11 23:54:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | 2009-01-11 23:52:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | 2009-01-11 23:50:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0002s (1.5469s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | uservisitors | range | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_visitorid_time | 8 | 125068 | 100.00 | Using where; Using index for group-by; Using temporary; Using filesort |
select `20100112_visitors`.`uservisitors`.`visitorid` AS `visitorid`,max(`20100112_visitors`.`uservisitors`.`time`) AS `lastvisit` from `20100112_visitors`.`uservisitors` where (`20100112_visitors`.`uservisitors`.`userid` = 2) group by `20100112_visitors`.`uservisitors`.`userid`,`20100112_visitors`.`uservisitors`.`visitorid` order by max(`20100112_visitors`.`uservisitors`.`time`) desc limit 5
This query uses INDEX FOR GROUP-BY
, but it needs to make the sorting twice: first to calculate the aggregates, second to satisfy the ORDER BY
clause.
An index scan can be used instead of the first sorting but the second sorting still needs to be done.
That's why there is a filesort
in the plan.
The query works for 1.5 seconds which is way too much for a busy service.
But since we only need the latest 5 visits to each user, we can just iterate the records using an index on time
and filter out the duplicates until we return 5 records.
This can be done by appending a simple condition into the WHERE
clause that would look back in time and find if there already were records with the same values of (userid, visitorid)
.
This is of course a variation of Schlemiel the painter's algorithm.
However, since we only need 5 records and the duplicate visitors are very improbable, this should work alright.
Let's check it:
SELECT visitorid FROM uservisitors uo WHERE userid = 1 AND NOT EXISTS ( SELECT NULL FROM uservisitors ui WHERE ui.userid = uo.userid AND (ui.time, ui.id) > (uo.time, uo.id) AND ui.visitorid = uo.visitorid ) ORDER BY userid DESC, time DESC LIMIT 5
visitorid | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0001s (0.0018s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | uo | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_time | 4 | const | 22940 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | ui | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_visitorid_time | 8 | 20100112_visitors.uo.userid,20100112_visitors.uo.visitorid | 4 | 100.00 | Using where; Using index |
Field or reference '20100112_visitors.uo.userid' of SELECT #2 was resolved in SELECT #1 Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1 Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1 Field or reference '20100112_visitors.uo.visitorid' of SELECT #2 was resolved in SELECT #1 select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(exists(select NULL AS `NULL` from `20100112_visitors`.`uservisitors` `ui` where ((`20100112_visitors`.`ui`.`userid` = `20100112_visitors`.`uo`.`userid`) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`)))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
Here we use NOT EXISTS
clause to check whether the current visitor had a previous visit (and therefore was already returned by the query). This eliminates duplicates.
MySQL
is not very good in optimizing the index access for correlated subqueries, that's why for each visitorid
it checks all records for the given (userid, visitorid)
(despite the fact that we have an additional range condition on time
which could be employed too). That's why we see a USING WHERE
in the plan for the DEPENDENT SUBQUERY
.
However, since every visitor has only 5 visits to any given userid
, this is not much of an issue.
The query completes in 1 ms which is next to instant.
Update of Jan 13th, 2010
As johno correctly pointed out, it is possible to use LEFT JOIN / IS NULL
or NOT IN
instead of NOT EXISTS
to get the same results:
NOT IN
SELECT visitorid FROM uservisitors uo WHERE userid = 1 AND (userid, visitorid) NOT IN ( SELECT userid, visitorid FROM uservisitors ui WHERE (ui.time, ui.id) > (uo.time, uo.id) ) ORDER BY userid DESC, time DESC LIMIT 5
visitorid | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0001s (0.0017s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | uo | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_time | 4 | const | 22940 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | ui | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_visitorid_time | 8 | func,func | 4 | 100.00 | Using where; Using index |
Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1 Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1 select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(<in_optimizer>((`20100112_visitors`.`uo`.`userid`,`20100112_visitors`.`uo`.`visitorid`),<exists>(select `20100112_visitors`.`ui`.`userid` AS `userid`,`20100112_visitors`.`ui`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `ui` where (((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (<cache>(`20100112_visitors`.`uo`.`userid`) = `20100112_visitors`.`ui`.`userid`) and (<cache>(`20100112_visitors`.`uo`.`visitorid`) = `20100112_visitors`.`ui`.`visitorid`)) having (<is_not_null_test>(`20100112_visitors`.`ui`.`userid`) and <is_not_null_test>(`20100112_visitors`.`ui`.`visitorid`))))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
LEFT JOIN / IS NULL
SELECT uo.visitorid FROM uservisitors uo LEFT JOIN uservisitors ui ON ui.userid = uo.userid AND (ui.time, ui.id) > (uo.time, uo.id) AND ui.visitorid = uo.visitorid WHERE uo.userid = 1 AND ui.userid IS NULL ORDER BY uo.userid DESC, uo.time DESC LIMIT 5
visitorid | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 rows fetched in 0.0001s (0.0018s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | uo | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_time | 4 | const | 22940 | 100.00 | Using where |
1 | SIMPLE | ui | ref | ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time | ix_uservisitors_userid_visitorid_time | 8 | const,20100112_visitors.uo.visitorid | 4 | 100.00 | Using where; Using index; Not exists |
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` left join `20100112_visitors`.`uservisitors` `ui` on(((`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`) and (`20100112_visitors`.`ui`.`userid` = 1) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)))) where ((`20100112_visitors`.`uo`.`userid` = 1) and isnull(`20100112_visitors`.`ui`.`userid`)) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
As I mentioned in one of the previous articles:
, MySQL optimizes all these three predicates to use the same algorithm (return a FALSE
of the first match), but the tests show that LEFT JOIN / IS NULL
and NOT IN
implement this algorithm a little bit more efficiently than NOT EXISTS
.
Though on such a small resultset the difference is negligible, it is always better to use more efficient solutions.