How to create fast database queries

Selecting friends

with 2 comments

If you are building a Yet Another Great Social Network Service to beat MySpace, you'll certainly need to keep a list of friends there, so that Alice may communicate in private with Bob, and they both can show pictures to Chris, and Eve cannot eavesdrop on them and the rest of them can do all these kinds of things these people are supposed to do.

On most networks, friendship is an irreflexive symmetric binary relation:

  • Symmetric means that if Alice is a friend of Bob, then Bob is a friend of Alice too.
  • Irreflexive means that Alice is never a friend to herself.

As it's a many-to-many relation, we sure need a separate table for it.

But how will we keep it? Should we keep the relation in the table as is (i. e. two separate rows for Alice/Bob and Bob/Alice), or keep just one row and reconstruct the relation using the set operators?

Let's check.

We'll create 50,000 users and randomly assign 20 friends to each of these users. Then we'll try to select list of friends of each of these users in a loop.

Table creation details

In the first table, t_fullfriends, we will keep two rows for each pair of friends. This is a violation of DKNF and occupies twice as much space, but the queries themselves are more simple.

In the second table, t_semifriends, we will keep only one row for each pair: least user id first, greatest user id second. This is more compact form and it doesn't violate DKNF, but we'll need more complex queries to fetch the relation out of the table.

In a loop, we will try to select a list of friends for each user.

In the first case, we will use:

SELECT  ffr_friend
FROM    t_user
JOIN    t_fullfriend
ON      ffr_user = usr_id
WHERE   usr_id = ?

, which is explained as:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_user const PRIMARY PRIMARY 4 const 1 100.00 Using index
1 SIMPLE t_fullfriend ref PRIMARY PRIMARY 4 const 41 100.00 Using index

It's quite straightforward: a NESTED LOOP using a PRIMARY INDEX access for both tables.

The second query is a bit more complex:

SELECT  IF(usr_id = sfr_least, sfr_greatest, sfr_least)
FROM    t_user
JOIN    t_semifriend
ON      sfr_least = usr_id OR sfr_greatest = usr_id
WHERE   usr_id = ?

, and explained as

id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_user const PRIMARY PRIMARY 4 const 1 100.00 Using index
1 SIMPLE t_semifriend index_merge PRIMARY,ix_semifriend_greatest PRIMARY,ix_semifriend_greatest 4,4 34 100.00 Using union(PRIMARY,ix_semifriend_greatest); Using where

Depending on which join condition had fired, we select either sfr_least or sfr_greatest as a friend id. The join condition uses an INDEX_MERGE access path: it's two ref's on the corresponding indices, combined together as with UNION operator.

The following code:

        $argc < 4 and die ("Usage: getfriends.php query_file host username [password]\n");
        $db = new mysqli($argv[2], $argv[3], isset($argv[4]) ? $argv[4] : NULL, "20090307_friends") or die ("Cannot connect: " . mysqli_connect_error() . "\n");
        $query = file_get_contents($argv[1]);
        $stmt = $db--->stmt_init() or die("Cannot initialize statement " . $db-&gt;error . "\n");
$stmt-&gt;prepare($query) or die ("Cannot prepare statement: " . $stmt-&gt;error . "\n");
$s = 0;
$res = 0;
for ($i = 1; $i &lt;= 50000; $i++) {
                switch ($stmt-&gt;param_count) {
case 1:
$stmt-&gt;bind_param('i', $i);
case 2:
$stmt-&gt;bind_param('ii', $i, $i);
die ("Wrong parameter count\n");
$res += $s;
print "$res\n";

proceduces the following results:

[20090307_friends]$ time php getfriends.php fullfriends.sql hostname username

real    0m26.419s
user    0m1.571s
sys     0m0.976s

[20090307_friends]$ time php getfriends.php semifriends.sql hostname username

real    0m56.727s
user    0m1.767s
sys     0m0.972s

That means that the denormalized t_fullfriend is twice as fast as the normalized t_semifriend.

But the query plan for t_semifriend doesn't look so terrible, and the table itself contains half as many records as t_fullfriend.

Why such a performance degradation?

Two reasons come to mind:

  1. INDEX_MERGE is less performant than plain UNION ALL
  2. When joining by sft_least, we can take all values right from the index. But when joining on sft_greatest, we need table access to t_semifriend to get sft_least

Let's check these out.

If we change the query againt t_semifriends to:

SELECT  sfr_greatest
FROM    t_user
JOIN    t_semifriend
ON      sfr_least = usr_id
WHERE   usr_id = ?
SELECT  sfr_least
FROM    t_user
JOIN    t_semifriend
ON      sfr_greatest = usr_id
WHERE   usr_id = ?

, we get:

[20090307_friends]$ time php getfriends.php semifriends_union.sql

real    0m34.233s
user    0m1.560s
sys     0m0.999s

This is almost as fast as t_fullfriends. Almost. But still not as fast.

Let's check another assumption:

CREATE UNIQUE INDEX ux_semifriends_greatest_least
ON t_semifriend(sfr_greatest, sfr_least)
[20090307_friends]$ time php getfriends.php semifriends.sql hostname username

real    0m53.436s
user    0m1.676s
sys     0m1.032s

[20090307_friends]$ time php getfriends.php semifriends_union.sql hostname username

real    0m33.697s
user    0m1.658s
sys     0m1.031s


So, from the experiments above we can conclude that:

  1. For the tables described above, INDEX_MERGE is less efficient than a plain UNION ALL
  2. Keeping two records for each pair instead of one increases performance

The best way to keep the friends relationship in a MySQL database is keeping two separate rows for each pair of friends.

This is both most efficient and most portable. If friendship is not a symmeric relation in your social model (as on LiveJournal), then you don't have to change anything.

Written by Quassnoi

March 7th, 2009 at 11:00 pm

Posted in MySQL

2 Responses to 'Selecting friends'

Subscribe to comments with RSS

  1. Hey, nice article! I’m amused by the consideration of interpersonal relationships as mathematical relations/ operators.

    Recording the pair both-ways is a worthwhile conclusion & performance-tip, too :)

    Thomas W

    22 Sep 13 at 08:35

  2. Thanks for the article! It was very informative. If I could change one thing though, I’d pick slightly more readable table names since the point of this is to inform people, but that’s just me. Keep up the good work.


    26 Jan 14 at 04:09

Leave a Reply