Six degrees of separation at EXPLAIN EXTENDED

Six degrees of separation

Answering questions asked on the site.

Kathy asks:

I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.

If your site grows popular, most probably, they are not. But we better check.

On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.

The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.

Let's create a sample table:

Table creation details

CREATE TABLE friends (
orestes INT NOT NULL,
pylades INT NOT NULL,
CHECK (orestes &lt; pylades)
);

SELECT  SETSEED(0.20100227);

INSERT
INTO    friends
SELECT  o, p
FROM    (
        SELECT  o, SUM(FLOOR(RANDOM() * 100000) + 1) OVER (PARTITION BY o ORDER BY n) AS p
        FROM    (
                SELECT  o, generate_series(1, 20) n
                FROM    generate_series(1, 1000000) o
                ) q
        ) q2
WHERE   o &lt; p
        AND p &lt;= 1000000;

ALTER TABLE friends ADD CONSTRAINT pk_friends_op PRIMARY KEY (orestes, pylades);

CREATE UNIQUE INDEX ux_friends_po ON friends (pylades, orestes);

This table stores records for 1,000,000 people having 20 friends each in average. The first column is named orestes and the second one pylades.

With new PostgreSQL 8.4 it is easy to write a recursive query that would traverse the relationship graph up to the given level and stop on the first match.

However, each recursion step requires a join, and as the number of records in the input recordset for the recusion grows with level, the joins become less and less efficient. The number of records grows exponentially, and on level 6 there will be about 20 ^ 6 = 64,000,000 records on input. This is just too much for a join with a 20,000,000 records table.

As the chain length increases, the tree diverges, the number of the records grows and it becomes more and more costly to join them with the table.

To work around this, we should use Bogdan the tunnel builder's algorithm.

British and French governments submit a tender to build a tunnel under the Channel. Many companies apply, all demanding years of time and billions pounds of money, so their offers are refused.

One day, Bogdan drops in and offers his services.

How much money would you demand for your work?, the official asks. Me and my brother Roman are good eaters, so we will need to buy a decent meal every day. 50 pounds a day will be OK.

That's pretty cheap; and how much time will you need? Me and my brother Roman are fast diggers; one mile a day I think we will dig.

Oh, that's pretty fast! But how will you be able to work on such a low budget in such a short time?, the official asks out of curiosity.

That's simple, Bogdan answers, I start digging from the British coast, my brother Roman starts digging from the French coast; the moment we meet, the work is over.

OK, says the official, but what if you don't meet?.

No worries then: you get two tunnels for the price of one

We should do a similar thing here. Instead of traversing the 6 levels from the beginning, we will traverse just 3 levels from each side, then join the resulting recordsets and hope some matching records will be found.

Traversing only 3 levels will be quite fast; and the resulting recordsets will be of moderate size so joining them will be easy.

As an extra, we will return the shortest path from one person to the other. To do this, we will need to record the friendship chain in an array. PostgreSQL does not offer an easy way to reverse an array, so in the first recordset, we will append the friends to the array, while in the second one we will prepend them. This way, we should just concatenate the resulting ~~tunnels~~arrays.

Here's the query:

WITH    RECURSIVE
q1 (person, chain, lvl) AS
(
SELECT  123456, ARRAY[123456], 1
UNION ALL
SELECT  friend, chain || friend, lvl + 1
FROM    (
SELECT  q1.*,
friend
FROM    q1
JOIN    (
SELECT  orestes AS me, pylades AS friend
FROM    friends
UNION ALL
SELECT  pylades AS me, orestes AS friend
FROM    friends
) f
ON      person = me
WHERE   lvl &lt;= 3
                ) qo
        ),
        q2 (person, chain, lvl) AS
        (
        SELECT  654321, ARRAY[654321], 1
        UNION ALL
        SELECT  friend, friend || chain, lvl + 1
        FROM    (
                SELECT  q2.*,
                        friend
                FROM    q2
                JOIN    (
                        SELECT  orestes AS me, pylades AS friend
                        FROM    friends
                        UNION ALL
                        SELECT  pylades AS me, orestes AS friend
                        FROM    friends
                        ) f
                ON      person = me
                WHERE   lvl &lt;= 3
                ) qo
        )
SELECT  (q1.chain || q2.chain[2:q2.lvl])::TEXT AS chain
FROM    q1
JOIN    q2
ON      q2.person = q1.person
ORDER BY
        q1.lvl + q2.lvl
LIMIT 1

View query details

chain
{123456,890237,278175,654321}
1 row fetched in 0.0003s (0.5313s)

Limit  (cost=1138629880.05..1138629880.05 rows=1 width=72)
  CTE q1
    ->  Recursive Union  (cost=0.00..71529.77 rows=2753901 width=40)
          ->  Result  (cost=0.00..0.01 rows=1 width=0)
          ->  Nested Loop  (cost=0.00..1645.17 rows=275390 width=40)
                Join Filter: (q1.person = "20100227_friends".friends.orestes)
                ->  WorkTable Scan on q1  (cost=0.00..0.22 rows=3 width=40)
                      Filter: (lvl <= 3)
                ->  Append  (cost=0.00..88.96 rows=30 width=8)
                      ->  Index Scan using pk_friends_op on friends  (cost=0.00..37.55 rows=17 width=8)
                            Index Cond: ("20100227_friends".friends.orestes = q1.person)
                      ->  Index Scan using ux_friends_po on friends  (cost=0.00..51.40 rows=13 width=8)
                            Index Cond: ("20100227_friends".friends.pylades = q1.person)
  CTE q2
    ->  Recursive Union  (cost=0.00..71529.77 rows=2753901 width=40)
          ->  Result  (cost=0.00..0.01 rows=1 width=0)
          ->  Nested Loop  (cost=0.00..1645.17 rows=275390 width=40)
                Join Filter: (q2.person = "20100227_friends".friends.orestes)
                ->  WorkTable Scan on q2  (cost=0.00..0.22 rows=3 width=40)
                      Filter: (lvl <= 3)
                ->  Append  (cost=0.00..88.96 rows=30 width=8)
                      ->  Index Scan using pk_friends_op on friends  (cost=0.00..37.55 rows=17 width=8)
                            Index Cond: ("20100227_friends".friends.orestes = q2.person)
                      ->  Index Scan using ux_friends_po on friends  (cost=0.00..51.40 rows=13 width=8)
                            Index Cond: ("20100227_friends".friends.pylades = q2.person)
  ->  Sort  (cost=1138486820.51..1233286454.49 rows=37919853589 width=72)
        Sort Key: ((q1.lvl + q2.lvl))
        ->  Merge Join  (cost=849904.33..948887552.57 rows=37919853589 width=72)
              Merge Cond: (q1.person = q2.person)
              ->  Sort  (cost=424952.16..431836.92 rows=2753901 width=40)
                    Sort Key: q1.person
                    ->  CTE Scan on q1  (cost=0.00..55078.02 rows=2753901 width=40)
              ->  Materialize  (cost=424952.16..459375.93 rows=2753901 width=40)
                    ->  Sort  (cost=424952.16..431836.92 rows=2753901 width=40)
                          Sort Key: q2.person
                          ->  CTE Scan on q2  (cost=0.00..55078.02 rows=2753901 width=40)

Note that the anchor part can not be used more than once in a recursive expression. To work around that, we had to join it to a derived table (a UNION ALL of two copies of the table with the columns swapped). However, PostgreSQL's optimizer was smart enough to push the join predicate into the derived table and distribute the queries so that each part uses a corresponding index efficiently. This helps to traverse the tree and build the recordsets from both ends.

Each of the recordsets has only about 8,000 records, so scanning and joining them is very fast.

The whole query takes just a little longer than 0.5 seconds.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 27th, 2010 at 11:00 pm

Posted in PostgreSQL

One Response to 'Six degrees of separation'

Subscribe to comments with RSS

The friendship connection may be symmetrical, but the friendship level, from one perspective to the other, may not be. Say people can rate their friendship on a scale of 1) Acquaintence, 2) Casual Friend, 3) Associate & Colleagues, 4) Good Friend, 5) Inner Circle and 6 BFF, BF, GF, Spouse. For example, Sam and Jane are friends. Sam views their friendship at a level 1. Jane would rate it at a 3. If that is the case, would you need a table like this:

f1_id f2_id f_level
1 2 1
2 1 3

How would this work with your query? What are some benefits of having weighted friendship levels with regard to social network analysis?

Thomas

22 Nov 16 at 07:30

EXPLAIN EXTENDED