Six degrees of separation
Answering questions asked on the site.
Kathy asks:
I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.
If your site grows popular, most probably, they are not. But we better check.
On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.
The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY
on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY
won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.
Let's create a sample table:
Table creation details
This table stores records for 1,000,000 people having 20 friends each in average. The first column is named orestes
and the second one pylades
.
With new PostgreSQL 8.4 it is easy to write a recursive query that would traverse the relationship graph up to the given level and stop on the first match.
However, each recursion step requires a join, and as the number of records in the input recordset for the recusion grows with level, the joins become less and less efficient. The number of records grows exponentially, and on level 6 there will be about 20 ^ 6 = 64,000,000 records on input. This is just too much for a join with a 20,000,000 records table.
As the chain length increases, the tree diverges, the number of the records grows and it becomes more and more costly to join them with the table.
To work around this, we should use Bogdan the tunnel builder's algorithm.
British and French governments submit a tender to build a tunnel under the Channel. Many companies apply, all demanding years of time and billions pounds of money, so their offers are refused.
One day, Bogdan drops in and offers his services.
How much money would you demand for your work?, the official asks.Me and my brother Roman are good eaters, so we will need to buy a decent meal every day. 50 pounds a day will be OK.
That's pretty cheap; and how much time will you need?Me and my brother Roman are fast diggers; one mile a day I think we will dig.
Oh, that's pretty fast! But how will you be able to work on such a low budget in such a short time?, the official asks out of curiosity.
That's simple, Bogdan answers,I start digging from the British coast, my brother Roman starts digging from the French coast; the moment we meet, the work is over.
OK, says the official,but what if you don't meet?.
No worries then: you get two tunnels for the price of one
We should do a similar thing here. Instead of traversing the 6 levels from the beginning, we will traverse just 3 levels from each side, then join the resulting recordsets and hope some matching records will be found.
Traversing only 3 levels will be quite fast; and the resulting recordsets will be of moderate size so joining them will be easy.
As an extra, we will return the shortest path from one person to the other. To do this, we will need to record the friendship chain in an array. PostgreSQL does not offer an easy way to reverse an array, so in the first recordset, we will append the friends to the array, while in the second one we will prepend them. This way, we should just concatenate the resulting tunnelsarrays.
Here's the query:
WITH RECURSIVE q1 (person, chain, lvl) AS ( SELECT 123456, ARRAY[123456], 1 UNION ALL SELECT friend, chain || friend, lvl + 1 FROM ( SELECT q1.*, friend FROM q1 JOIN ( SELECT orestes AS me, pylades AS friend FROM friends UNION ALL SELECT pylades AS me, orestes AS friend FROM friends ) f ON person = me WHERE lvl <= 3 ) qo ), q2 (person, chain, lvl) AS ( SELECT 654321, ARRAY[654321], 1 UNION ALL SELECT friend, friend || chain, lvl + 1 FROM ( SELECT q2.*, friend FROM q2 JOIN ( SELECT orestes AS me, pylades AS friend FROM friends UNION ALL SELECT pylades AS me, orestes AS friend FROM friends ) f ON person = me WHERE lvl <= 3 ) qo ) SELECT (q1.chain || q2.chain[2:q2.lvl])::TEXT AS chain FROM q1 JOIN q2 ON q2.person = q1.person ORDER BY q1.lvl + q2.lvl LIMIT 1
Note that the anchor part can not be used more than once in a recursive expression. To work around that, we had to join it to a derived table (a UNION ALL
of two copies of the table with the columns swapped). However, PostgreSQL's optimizer was smart enough to push the join predicate into the derived table and distribute the queries so that each part uses a corresponding index efficiently. This helps to traverse the tree and build the recordsets from both ends.
Each of the recordsets has only about 8,000 records, so scanning and joining them is very fast.
The whole query takes just a little longer than 0.5 seconds.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
The friendship connection may be symmetrical, but the friendship level, from one perspective to the other, may not be. Say people can rate their friendship on a scale of 1) Acquaintence, 2) Casual Friend, 3) Associate & Colleagues, 4) Good Friend, 5) Inner Circle and 6 BFF, BF, GF, Spouse. For example, Sam and Jane are friends. Sam views their friendship at a level 1. Jane would rate it at a 3. If that is the case, would you need a table like this:
f1_id f2_id f_level
1 2 1
2 1 3
How would this work with your query? What are some benefits of having weighted friendship levels with regard to social network analysis?
Thomas
22 Nov 16 at 07:30