Selecting friends

If you are building a Yet Another Great Social Network Service to beat MySpace, you'll certainly need to keep a list of friends there, so that Alice may communicate in private with Bob, and they both can show pictures to Chris, and Eve cannot eavesdrop on them and the rest of them can do all these kinds of things these people are supposed to do.

On most networks, friendship is an irreflexive symmetric binary relation:

Symmetric means that if Alice is a friend of Bob, then Bob is a friend of Alice too.
Irreflexive means that Alice is never a friend to herself.

As it's a many-to-many relation, we sure need a separate table for it.

But how will we keep it? Should we keep the relation in the table as is (i. e. two separate rows for Alice/Bob and Bob/Alice), or keep just one row and reconstruct the relation using the set operators?

Let's check.

We'll create 50,000 users and randomly assign 20 friends to each of these users. Then we'll try to select list of friends of each of these users in a loop.

Table creation details

CREATE TABLE  20090307_friends.t_user (
usr_id int(10) unsigned NOT NULL AUTO_INCREMENT,
usr_name varchar(45) NOT NULL,
PRIMARY KEY (usr_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE  20090307_friends.t_fullfriend (
ffr_user int(10) unsigned NOT NULL,
ffr_friend int(10) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE  20090307_friends.t_semifriend (
sfr_least int(10) unsigned NOT NULL,
sfr_greatest int(10) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

DROP PROCEDURE IF EXISTS 20090307_friends.prc_fill_user $$
CREATE PROCEDURE 20090307_friends.prc_fill_user(cnt INT)
BEGIN
DECLARE i INT;
SET i := 1;
WHILE i &lt;= cnt DO
           INSERT
              INTO    t_user (usr_id, usr_name)
           VALUES  (i, CONCAT('User ', i));
          SET i := i + 1;
     END WHILE;
END $$

DELIMITER ;

START TRANSACTION;
CALL prc_fill_user(50000);
COMMIT;

INSERT
INTO    t_fullfriend (ffr_user, ffr_friend)
SELECT  CASE WHEN dir = 1 THEN usr_id ELSE usr_friend END,
        CASE WHEN dir = 2 THEN usr_id ELSE usr_friend END
FROM    (
        SELECT  1 AS dir
        UNION ALL
        SELECT  2
        ) d
STRAIGHT_JOIN
        (
        SELECT  usr_id,
               (125 * (usr_which - 1) + FLOOR(RAND(20090307) * 125)) * 20 + (usr_id % 20) + 1 AS usr_friend
        FROM    (
                SELECT  usr_id AS usr_which
                FROM    t_user USE INDEX ()
                LIMIT 20
                ) f
        STRAIGHT_JOIN
                t_user u USE INDEX ()
        ) uo;
ALTER TABLE t_fullfriend ADD PRIMARY KEY (ffr_user, ffr_friend);

INSERT
INTO    t_semifriend (sfr_least, sfr_greatest)
SELECT  ffr_user, ffr_friend
FROM    t_fullfriend USE INDEX ()
WHERE   ffr_user &lt; ffr_friend;

ALTER TABLE t_semifriend ADD PRIMARY KEY (sfr_least, sfr_greatest); 
CREATE INDEX ix_semifriend_greatest ON t_semifriend (sfr_greatest);

In the first table, t_fullfriends, we will keep two rows for each pair of friends. This is a violation of DKNF and occupies twice as much space, but the queries themselves are more simple.

In the second table, t_semifriends, we will keep only one row for each pair: least user id first, greatest user id second. This is more compact form and it doesn't violate DKNF, but we'll need more complex queries to fetch the relation out of the table.

In a loop, we will try to select a list of friends for each user.

In the first case, we will use:

SELECT  ffr_friend
FROM    t_user
JOIN    t_fullfriend
ON      ffr_user = usr_id
WHERE   usr_id = ?

, which is explained as:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_user	const	PRIMARY	PRIMARY	4	const	1	100.00	Using index
1	SIMPLE	t_fullfriend	ref	PRIMARY	PRIMARY	4	const	41	100.00	Using index

It's quite straightforward: a NESTED LOOP using a PRIMARY INDEX access for both tables.

The second query is a bit more complex:

SELECT  IF(usr_id = sfr_least, sfr_greatest, sfr_least)
FROM    t_user
JOIN    t_semifriend
ON      sfr_least = usr_id OR sfr_greatest = usr_id
WHERE   usr_id = ?

, and explained as

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_user	const	PRIMARY	PRIMARY	4	const	1	100.00	Using index
1	SIMPLE	t_semifriend	index_merge	PRIMARY,ix_semifriend_greatest	PRIMARY,ix_semifriend_greatest	4,4		34	100.00	Using union(PRIMARY,ix_semifriend_greatest); Using where

Depending on which join condition had fired, we select either sfr_least or sfr_greatest as a friend id. The join condition uses an INDEX_MERGE access path: it's two ref's on the corresponding indices, combined together as with UNION operator.

The following code:

<!--?php
        $argc < 4 and die ("Usage: getfriends.php query_file host username [password]\n");
        $db = new mysqli($argv[2], $argv[3], isset($argv[4]) ? $argv[4] : NULL, "20090307_friends") or die ("Cannot connect: " . mysqli_connect_error() . "\n");
        $query = file_get_contents($argv[1]);
        $stmt = $db--->stmt_init() or die("Cannot initialize statement " . $db-&gt;error . "\n");
$stmt-&gt;prepare($query) or die ("Cannot prepare statement: " . $stmt-&gt;error . "\n");
$s = 0;
$res = 0;
$stmt-&gt;bind_result($s);
for ($i = 1; $i &lt;= 50000; $i++) {
                switch ($stmt-&gt;param_count) {
case 1:
$stmt-&gt;bind_param('i', $i);
break;
case 2:
$stmt-&gt;bind_param('ii', $i, $i);
break;
default:
die ("Wrong parameter count\n");
}
$stmt-&gt;execute();
while($stmt-&gt;fetch())
$res += $s;
}
print "$res\n";
?&gt;

proceduces the following results:

[20090307_friends]$ time php getfriends.php fullfriends.sql hostname username
50000739300

real    0m26.419s
user    0m1.571s
sys     0m0.976s

[20090307_friends]$ time php getfriends.php semifriends.sql hostname username
50000739300

real    0m56.727s
user    0m1.767s
sys     0m0.972s

That means that the denormalized t_fullfriend is twice as fast as the normalized t_semifriend.

But the query plan for t_semifriend doesn't look so terrible, and the table itself contains half as many records as t_fullfriend.

Why such a performance degradation?

Two reasons come to mind:

INDEX_MERGE is less performant than plain UNION ALL
When joining by sft_least, we can take all values right from the index. But when joining on sft_greatest, we need table access to t_semifriend to get sft_least

Let's check these out.

If we change the query againt t_semifriends to:

SELECT  sfr_greatest
FROM    t_user
JOIN    t_semifriend
ON      sfr_least = usr_id
WHERE   usr_id = ?
UNION ALL
SELECT  sfr_least
FROM    t_user
JOIN    t_semifriend
ON      sfr_greatest = usr_id
WHERE   usr_id = ?

, we get:

[20090307_friends]$ time php getfriends.php semifriends_union.sql
50000739300

real    0m34.233s
user    0m1.560s
sys     0m0.999s

This is almost as fast as t_fullfriends. Almost. But still not as fast.

Let's check another assumption:

CREATE UNIQUE INDEX ux_semifriends_greatest_least
ON t_semifriend(sfr_greatest, sfr_least)

[20090307_friends]$ time php getfriends.php semifriends.sql hostname username
50000739300

real    0m53.436s
user    0m1.676s
sys     0m1.032s

[20090307_friends]$ time php getfriends.php semifriends_union.sql hostname username
50000739300

real    0m33.697s
user    0m1.658s
sys     0m1.031s

Nope.

So, from the experiments above we can conclude that:

For the tables described above, INDEX_MERGE is less efficient than a plain UNION ALL
Keeping two records for each pair instead of one increases performance

The best way to keep the friends relationship in a MySQL database is keeping two separate rows for each pair of friends.

This is both most efficient and most portable. If friendship is not a symmeric relation in your social model (as on LiveJournal), then you don't have to change anything.

Written by Quassnoi

March 7th, 2009 at 11:00 pm

Posted in MySQL

2 Responses to 'Selecting friends'

Subscribe to comments with RSS

Hey, nice article! I’m amused by the consideration of interpersonal relationships as mathematical relations/ operators.

Recording the pair both-ways is a worthwhile conclusion & performance-tip, too :)

Thomas W

22 Sep 13 at 08:35
Thanks for the article! It was very informative. If I could change one thing though, I’d pick slightly more readable table names since the point of this is to inform people, but that’s just me. Keep up the good work.

Jamil

26 Jan 14 at 04:09

EXPLAIN EXTENDED