EXPLAIN EXTENDED

How to create fast database queries

Latest visitors

Comments enabled. I *really* need your comment

From Stack Overflow:

Say you want to display the latest visitors on a users profile page.

How would you structure this?

Perhaps a table called uservisitors:

uservisitors
userid visitorid time

And how would you select this with MySQL without any duplicates?

What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.

There are two approaches to this.

First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 1
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5

However, there is a little problem with this solution.

Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY optimization for this query, it will still have to sort on MAX(time) to find 5 latest records.

This will require sorting the whole resultset which will be huge if the service is heavily loaded.

Let's test it on a sample table:

Table creation details

This table has 1,000,000 records for 2 different users with 100,000 unique visitors. This gives 5 visits to each user's page per visitor.

This reflects real distribution of the data on an actual site quite well.

The query we initially came up with gives the following results:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 2
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5
visitorid lastvisit
2 2009-01-11 23:58:00
4 2009-01-11 23:56:00
6 2009-01-11 23:54:00
8 2009-01-11 23:52:00
10 2009-01-11 23:50:00
5 rows fetched in 0.0002s (1.5469s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE uservisitors range ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_visitorid_time 8 125068 100.00 Using where; Using index for group-by; Using temporary; Using filesort
select `20100112_visitors`.`uservisitors`.`visitorid` AS `visitorid`,max(`20100112_visitors`.`uservisitors`.`time`) AS `lastvisit` from `20100112_visitors`.`uservisitors` where (`20100112_visitors`.`uservisitors`.`userid` = 2) group by `20100112_visitors`.`uservisitors`.`userid`,`20100112_visitors`.`uservisitors`.`visitorid` order by max(`20100112_visitors`.`uservisitors`.`time`) desc limit 5

This query uses INDEX FOR GROUP-BY, but it needs to make the sorting twice: first to calculate the aggregates, second to satisfy the ORDER BY clause.

An index scan can be used instead of the first sorting but the second sorting still needs to be done.

That's why there is a filesort in the plan.

The query works for 1.5 seconds which is way too much for a busy service.

But since we only need the latest 5 visits to each user, we can just iterate the records using an index on time and filter out the duplicates until we return 5 records.

This can be done by appending a simple condition into the WHERE clause that would look back in time and find if there already were records with the same values of (userid, visitorid).

This is of course a variation of Schlemiel the painter's algorithm.

However, since we only need 5 records and the duplicate visitors are very improbable, this should work alright.

Let's check it:

SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
AND NOT EXISTS
(
SELECT  NULL
FROM    uservisitors ui
WHERE   ui.userid = uo.userid
AND (ui.time, ui.id) > (uo.time, uo.id)
AND ui.visitorid = uo.visitorid
)
ORDER BY
userid DESC, time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0018s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY uo ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_time 4 const 22940 100.00 Using where
2 DEPENDENT SUBQUERY ui ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_visitorid_time 8 20100112_visitors.uo.userid,20100112_visitors.uo.visitorid 4 100.00 Using where; Using index
Field or reference '20100112_visitors.uo.userid' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.visitorid' of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(exists(select NULL AS `NULL` from `20100112_visitors`.`uservisitors` `ui` where ((`20100112_visitors`.`ui`.`userid` = `20100112_visitors`.`uo`.`userid`) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`)))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

Here we use NOT EXISTS clause to check whether the current visitor had a previous visit (and therefore was already returned by the query). This eliminates duplicates.

MySQL is not very good in optimizing the index access for correlated subqueries, that's why for each visitorid it checks all records for the given (userid, visitorid) (despite the fact that we have an additional range condition on time which could be employed too). That's why we see a USING WHERE in the plan for the DEPENDENT SUBQUERY.

However, since every visitor has only 5 visits to any given userid, this is not much of an issue.

The query completes in 1 ms which is next to instant.

Update of Jan 13th, 2010

As johno correctly pointed out, it is possible to use LEFT JOIN / IS NULL or NOT IN instead of NOT EXISTS to get the same results:

NOT IN

SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
AND (userid, visitorid) NOT IN
(
SELECT  userid, visitorid
FROM    uservisitors ui
WHERE   (ui.time, ui.id) > (uo.time, uo.id)
)
ORDER BY
userid DESC, time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0017s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY uo ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_time 4 const 22940 100.00 Using where
2 DEPENDENT SUBQUERY ui ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_visitorid_time 8 func,func 4 100.00 Using where; Using index
Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(<in_optimizer>((`20100112_visitors`.`uo`.`userid`,`20100112_visitors`.`uo`.`visitorid`),<exists>(select `20100112_visitors`.`ui`.`userid` AS `userid`,`20100112_visitors`.`ui`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `ui` where (((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (<cache>(`20100112_visitors`.`uo`.`userid`) = `20100112_visitors`.`ui`.`userid`) and (<cache>(`20100112_visitors`.`uo`.`visitorid`) = `20100112_visitors`.`ui`.`visitorid`)) having (<is_not_null_test>(`20100112_visitors`.`ui`.`userid`) and <is_not_null_test>(`20100112_visitors`.`ui`.`visitorid`))))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

LEFT JOIN / IS NULL

SELECT  uo.visitorid
FROM    uservisitors uo
LEFT JOIN
uservisitors ui
ON      ui.userid = uo.userid
AND (ui.time, ui.id) &gt; (uo.time, uo.id)
AND ui.visitorid = uo.visitorid
WHERE   uo.userid = 1
AND ui.userid IS NULL
ORDER BY
uo.userid DESC, uo.time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0018s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE uo ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_time 4 const 22940 100.00 Using where
1 SIMPLE ui ref ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time ix_uservisitors_userid_visitorid_time 8 const,20100112_visitors.uo.visitorid 4 100.00 Using where; Using index; Not exists
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` left join `20100112_visitors`.`uservisitors` `ui` on(((`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`) and (`20100112_visitors`.`ui`.`userid` = 1) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)))) where ((`20100112_visitors`.`uo`.`userid` = 1) and isnull(`20100112_visitors`.`ui`.`userid`)) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

As I mentioned in one of the previous articles:

, MySQL optimizes all these three predicates to use the same algorithm (return a FALSE of the first match), but the tests show that LEFT JOIN / IS NULL and NOT IN implement this algorithm a little bit more efficiently than NOT EXISTS.

Though on such a small resultset the difference is negligible, it is always better to use more efficient solutions.

Written by Quassnoi

January 12th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply