Latest visitors at EXPLAIN EXTENDED

Say you want to display the latest visitors on a users profile page.

How would you structure this?

Perhaps a table called uservisitors:

uservisitors

userid visitorid time

And how would you select this with MySQL without any duplicates?

What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.

There are two approaches to this.

First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 1
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5

However, there is a little problem with this solution.

Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY optimization for this query, it will still have to sort on MAX(time) to find 5 latest records.

This will require sorting the whole resultset which will be huge if the service is heavily loaded.

Let's test it on a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE uservisitors (
id INT NOT NULL PRIMARY KEY,
userid INT NOT NULL,
visitorid INT NOT NULL,
time DATETIME NOT NULL,
KEY ix_uservisitors_userid_visitorid_time (userid, visitorid, time),
KEY ix_uservisitors_userid_time (userid, time)
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    uservisitors
SELECT  id, (id - 1) % 2 + 1, (id - 1) % 100000 + 1, '2009-01-12' - INTERVAL id MINUTE
FROM    filler;

This table has 1,000,000 records for 2 different users with 100,000 unique visitors. This gives 5 visits to each user's page per visitor.

This reflects real distribution of the data on an actual site quite well.

The query we initially came up with gives the following results:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 2
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5

visitorid	lastvisit
2	2009-01-11 23:58:00
4	2009-01-11 23:56:00
6	2009-01-11 23:54:00
8	2009-01-11 23:52:00
10	2009-01-11 23:50:00
5 rows fetched in 0.0002s (1.5469s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	uservisitors	range	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_visitorid_time	8		125068	100.00	Using where; Using index for group-by; Using temporary; Using filesort

select `20100112_visitors`.`uservisitors`.`visitorid` AS `visitorid`,max(`20100112_visitors`.`uservisitors`.`time`) AS `lastvisit` from `20100112_visitors`.`uservisitors` where (`20100112_visitors`.`uservisitors`.`userid` = 2) group by `20100112_visitors`.`uservisitors`.`userid`,`20100112_visitors`.`uservisitors`.`visitorid` order by max(`20100112_visitors`.`uservisitors`.`time`) desc limit 5

This query uses INDEX FOR GROUP-BY, but it needs to make the sorting twice: first to calculate the aggregates, second to satisfy the ORDER BY clause.

An index scan can be used instead of the first sorting but the second sorting still needs to be done.

That's why there is a filesort in the plan.

The query works for 1.5 seconds which is way too much for a busy service.

But since we only need the latest 5 visits to each user, we can just iterate the records using an index on time and filter out the duplicates until we return 5 records.

This can be done by appending a simple condition into the WHERE clause that would look back in time and find if there already were records with the same values of (userid, visitorid).

This is of course a variation of Schlemiel the painter's algorithm.

However, since we only need 5 records and the duplicate visitors are very improbable, this should work alright.

Let's check it:

SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
AND NOT EXISTS
(
SELECT  NULL
FROM    uservisitors ui
WHERE   ui.userid = uo.userid
AND (ui.time, ui.id) &gt; (uo.time, uo.id)
AND ui.visitorid = uo.visitorid
)
ORDER BY
userid DESC, time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0018s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	uo	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_time	4	const	22940	100.00	Using where
2	DEPENDENT SUBQUERY	ui	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_visitorid_time	8	20100112_visitors.uo.userid,20100112_visitors.uo.visitorid	4	100.00	Using where; Using index

Field or reference '20100112_visitors.uo.userid' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.visitorid' of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(exists(select NULL AS `NULL` from `20100112_visitors`.`uservisitors` `ui` where ((`20100112_visitors`.`ui`.`userid` = `20100112_visitors`.`uo`.`userid`) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`)))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

Here we use NOT EXISTS clause to check whether the current visitor had a previous visit (and therefore was already returned by the query). This eliminates duplicates.

MySQL is not very good in optimizing the index access for correlated subqueries, that's why for each visitorid it checks all records for the given (userid, visitorid) (despite the fact that we have an additional range condition on time which could be employed too). That's why we see a USING WHERE in the plan for the DEPENDENT SUBQUERY.

However, since every visitor has only 5 visits to any given userid, this is not much of an issue.

The query completes in 1 ms which is next to instant.

Update of Jan 13th, 2010

As johno correctly pointed out, it is possible to use LEFT JOIN / IS NULL or NOT IN instead of NOT EXISTS to get the same results:

NOT IN

SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
AND (userid, visitorid) NOT IN
(
SELECT  userid, visitorid
FROM    uservisitors ui
WHERE   (ui.time, ui.id) &gt; (uo.time, uo.id)
)
ORDER BY
userid DESC, time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0017s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	uo	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_time	4	const	22940	100.00	Using where
2	DEPENDENT SUBQUERY	ui	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_visitorid_time	8	func,func	4	100.00	Using where; Using index

Field or reference '20100112_visitors.uo.time' of SELECT #2 was resolved in SELECT #1
Field or reference '20100112_visitors.uo.id' of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(<in_optimizer>((`20100112_visitors`.`uo`.`userid`,`20100112_visitors`.`uo`.`visitorid`),<exists>(select `20100112_visitors`.`ui`.`userid` AS `userid`,`20100112_visitors`.`ui`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `ui` where (((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (<cache>(`20100112_visitors`.`uo`.`userid`) = `20100112_visitors`.`ui`.`userid`) and (<cache>(`20100112_visitors`.`uo`.`visitorid`) = `20100112_visitors`.`ui`.`visitorid`)) having (<is_not_null_test>(`20100112_visitors`.`ui`.`userid`) and <is_not_null_test>(`20100112_visitors`.`ui`.`visitorid`))))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

LEFT JOIN / IS NULL

SELECT  uo.visitorid
FROM    uservisitors uo
LEFT JOIN
uservisitors ui
ON      ui.userid = uo.userid
AND (ui.time, ui.id) &gt; (uo.time, uo.id)
AND ui.visitorid = uo.visitorid
WHERE   uo.userid = 1
AND ui.userid IS NULL
ORDER BY
uo.userid DESC, uo.time DESC
LIMIT 5

visitorid
1
3
5
7
9
5 rows fetched in 0.0001s (0.0018s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	uo	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_time	4	const	22940	100.00	Using where
1	SIMPLE	ui	ref	ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time	ix_uservisitors_userid_visitorid_time	8	const,20100112_visitors.uo.visitorid	4	100.00	Using where; Using index; Not exists

select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` left join `20100112_visitors`.`uservisitors` `ui` on(((`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`) and (`20100112_visitors`.`ui`.`userid` = 1) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) > (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)))) where ((`20100112_visitors`.`uo`.`userid` = 1) and isnull(`20100112_visitors`.`ui`.`userid`)) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5

As I mentioned in one of the previous articles:

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

, MySQL optimizes all these three predicates to use the same algorithm (return a FALSE of the first match), but the tests show that LEFT JOIN / IS NULL and NOT IN implement this algorithm a little bit more efficiently than NOT EXISTS.

Though on such a small resultset the difference is negligible, it is always better to use more efficient solutions.

Written by Quassnoi

January 12th, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Latest visitors

Update of Jan 13th, 2010

NOT IN

LEFT JOIN / IS NULL

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Latest visitors

Update of Jan 13th, 2010

NOT IN

LEFT JOIN / IS NULL

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow