Selecting friends
If you are building a Yet Another Great Social Network Service to beat MySpace, you'll certainly need to keep a list of friends there, so that Alice may communicate in private with Bob, and they both can show pictures to Chris, and Eve cannot eavesdrop on them and the rest of them can do all these kinds of things these people are supposed to do.
On most networks, friendship is an irreflexive symmetric binary relation:
- Symmetric means that if Alice is a friend of Bob, then Bob is a friend of Alice too.
- Irreflexive means that Alice is never a friend to herself.
As it's a many-to-many relation, we sure need a separate table for it.
But how will we keep it? Should we keep the relation in the table as is (i. e. two separate rows for Alice/Bob and Bob/Alice), or keep just one row and reconstruct the relation using the set operators?
Let's check.
We'll create 50,000 users and randomly assign 20 friends to each of these users. Then we'll try to select list of friends of each of these users in a loop.
In the first table, t_fullfriends
, we will keep two rows for each pair of friends. This is a violation of DKNF and occupies twice as much space, but the queries themselves are more simple.
In the second table, t_semifriends
, we will keep only one row for each pair: least user id
first, greatest user id
second. This is more compact form and it doesn't violate DKNF, but we'll need more complex queries to fetch the relation out of the table.
In a loop, we will try to select a list of friends for each user.
In the first case, we will use:
SELECT ffr_friend FROM t_user JOIN t_fullfriend ON ffr_user = usr_id WHERE usr_id = ?
, which is explained as:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_user | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
1 | SIMPLE | t_fullfriend | ref | PRIMARY | PRIMARY | 4 | const | 41 | 100.00 | Using index |
It's quite straightforward: a NESTED LOOP
using a PRIMARY INDEX
access for both tables.
The second query is a bit more complex:
SELECT IF(usr_id = sfr_least, sfr_greatest, sfr_least) FROM t_user JOIN t_semifriend ON sfr_least = usr_id OR sfr_greatest = usr_id WHERE usr_id = ?
, and explained as
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_user | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
1 | SIMPLE | t_semifriend | index_merge | PRIMARY,ix_semifriend_greatest | PRIMARY,ix_semifriend_greatest | 4,4 | 34 | 100.00 | Using union(PRIMARY,ix_semifriend_greatest); Using where |
Depending on which join condition had fired, we select either sfr_least
or sfr_greatest
as a friend id
. The join condition uses an INDEX_MERGE
access path: it's two ref
's on the corresponding indices, combined together as with UNION
operator.
The following code:
<!--?php $argc < 4 and die ("Usage: getfriends.php query_file host username [password]\n"); $db = new mysqli($argv[2], $argv[3], isset($argv[4]) ? $argv[4] : NULL, "20090307_friends") or die ("Cannot connect: " . mysqli_connect_error() . "\n"); $query = file_get_contents($argv[1]); $stmt = $db--->stmt_init() or die("Cannot initialize statement " . $db->error . "\n"); $stmt->prepare($query) or die ("Cannot prepare statement: " . $stmt->error . "\n"); $s = 0; $res = 0; $stmt->bind_result($s); for ($i = 1; $i <= 50000; $i++) { switch ($stmt->param_count) { case 1: $stmt->bind_param('i', $i); break; case 2: $stmt->bind_param('ii', $i, $i); break; default: die ("Wrong parameter count\n"); } $stmt->execute(); while($stmt->fetch()) $res += $s; } print "$res\n"; ?>
proceduces the following results:
[20090307_friends]$ time php getfriends.php fullfriends.sql hostname username 50000739300 real 0m26.419s user 0m1.571s sys 0m0.976s [20090307_friends]$ time php getfriends.php semifriends.sql hostname username 50000739300 real 0m56.727s user 0m1.767s sys 0m0.972s
That means that the denormalized t_fullfriend
is twice as fast as the normalized t_semifriend
.
But the query plan for t_semifriend
doesn't look so terrible, and the table itself contains half as many records as t_fullfriend
.
Why such a performance degradation?
Two reasons come to mind:
INDEX_MERGE
is less performant than plainUNION ALL
- When joining by
sft_least
, we can take all values right from the index. But when joining onsft_greatest
, we need table access tot_semifriend
to getsft_least
Let's check these out.
If we change the query againt t_semifriends
to:
SELECT sfr_greatest FROM t_user JOIN t_semifriend ON sfr_least = usr_id WHERE usr_id = ? UNION ALL SELECT sfr_least FROM t_user JOIN t_semifriend ON sfr_greatest = usr_id WHERE usr_id = ?
, we get:
[20090307_friends]$ time php getfriends.php semifriends_union.sql 50000739300 real 0m34.233s user 0m1.560s sys 0m0.999s
This is almost as fast as t_fullfriends
. Almost. But still not as fast.
Let's check another assumption:
CREATE UNIQUE INDEX ux_semifriends_greatest_least ON t_semifriend(sfr_greatest, sfr_least)
[20090307_friends]$ time php getfriends.php semifriends.sql hostname username 50000739300 real 0m53.436s user 0m1.676s sys 0m1.032s [20090307_friends]$ time php getfriends.php semifriends_union.sql hostname username 50000739300 real 0m33.697s user 0m1.658s sys 0m1.031s
Nope.
So, from the experiments above we can conclude that:
- For the tables described above,
INDEX_MERGE
is less efficient than a plainUNION ALL
- Keeping two records for each pair instead of one increases performance
The best way to keep the friends relationship in a MySQL database is keeping two separate rows for each pair of friends.
This is both most efficient and most portable. If friendship is not a symmeric relation in your social model (as on LiveJournal), then you don't have to change anything.
Hey, nice article! I’m amused by the consideration of interpersonal relationships as mathematical relations/ operators.
Recording the pair both-ways is a worthwhile conclusion & performance-tip, too :)
Thomas W
22 Sep 13 at 08:35
Thanks for the article! It was very informative. If I could change one thing though, I’d pick slightly more readable table names since the point of this is to inform people, but that’s just me. Keep up the good work.
Jamil
26 Jan 14 at 04:09