Answering questions asked on the site.
I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.
If your site grows popular, most probably, they are not. But we better check.
On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.
The friendship relationship is best stored in a many-to-many link table with a
PRIMARY KEY on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a
PRIMARY KEY won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.
Let's create a sample table:
Table creation details
CREATE TABLE friends ( orestes INT NOT NULL, pylades INT NOT NULL, CHECK (orestes < pylades) ); SELECT SETSEED(0.20100227); INSERT INTO friends SELECT o, p FROM ( SELECT o, SUM(FLOOR(RANDOM() * 100000) + 1) OVER (PARTITION BY o ORDER BY n) AS p FROM ( SELECT o, generate_series(1, 20) n FROM generate_series(1, 1000000) o ) q ) q2 WHERE o < p AND p <= 1000000; ALTER TABLE friends ADD CONSTRAINT pk_friends_op PRIMARY KEY (orestes, pylades); CREATE UNIQUE INDEX ux_friends_po ON friends (pylades, orestes);
This table stores records for 1,000,000 people having 20 friends each in average. The first column is named
orestes and the second one
With new PostgreSQL 8.4 it is easy to write a recursive query that would traverse the relationship graph up to the given level and stop on the first match.
However, each recursion step requires a join, and as the number of records in the input recordset for the recusion grows with level, the joins become less and less efficient. The number of records grows exponentially, and on level 6 there will be about 20 ^ 6 = 64,000,000 records on input. This is just too much for a join with a 20,000,000 records table.
As the chain length increases, the tree diverges, the number of the records grows and it becomes more and more costly to join them with the table.
To work around this, we should use Bogdan the tunnel builder's algorithm.
British and French governments submit a tender to build a tunnel under the Channel. Many companies apply, all demanding years of time and billions pounds of money, so their offers are refused.
One day, Bogdan drops in and offers his services.
How much money would you demand for your work?, the official asks.Me and my brother Roman are good eaters, so we will need to buy a decent meal every day. 50 pounds a day will be OK.
That's pretty cheap; and how much time will you need?Me and my brother Roman are fast diggers; one mile a day I think we will dig.
Oh, that's pretty fast! But how will you be able to work on such a low budget in such a short time?, the official asks out of curiosity.
That's simple, Bogdan answers,I start digging from the British coast, my brother Roman starts digging from the French coast; the moment we meet, the work is over.
OK, says the official,but what if you don't meet?.
No worries then: you get two tunnels for the price of one
We should do a similar thing here. Instead of traversing the 6 levels from the beginning, we will traverse just 3 levels from each side, then join the resulting recordsets and hope some matching records will be found.
Traversing only 3 levels will be quite fast; and the resulting recordsets will be of moderate size so joining them will be easy.
As an extra, we will return the shortest path from one person to the other. To do this, we will need to record the friendship chain in an array. PostgreSQL does not offer an easy way to reverse an array, so in the first recordset, we will append the friends to the array, while in the second one we will prepend them. This way, we should just concatenate the resulting
Here's the query:
WITH RECURSIVE q1 (person, chain, lvl) AS ( SELECT 123456, ARRAY, 1 UNION ALL SELECT friend, chain || friend, lvl + 1 FROM ( SELECT q1.*, friend FROM q1 JOIN ( SELECT orestes AS me, pylades AS friend FROM friends UNION ALL SELECT pylades AS me, orestes AS friend FROM friends ) f ON person = me WHERE lvl <= 3 ) qo ), q2 (person, chain, lvl) AS ( SELECT 654321, ARRAY, 1 UNION ALL SELECT friend, friend || chain, lvl + 1 FROM ( SELECT q2.*, friend FROM q2 JOIN ( SELECT orestes AS me, pylades AS friend FROM friends UNION ALL SELECT pylades AS me, orestes AS friend FROM friends ) f ON person = me WHERE lvl <= 3 ) qo ) SELECT (q1.chain || q2.chain[2:q2.lvl])::TEXT AS chain FROM q1 JOIN q2 ON q2.person = q1.person ORDER BY q1.lvl + q2.lvl LIMIT 1
|1 row fetched in 0.0003s (0.5313s)|
Limit (cost=1138629880.05..1138629880.05 rows=1 width=72) CTE q1 -> Recursive Union (cost=0.00..71529.77 rows=2753901 width=40) -> Result (cost=0.00..0.01 rows=1 width=0) -> Nested Loop (cost=0.00..1645.17 rows=275390 width=40) Join Filter: (q1.person = "20100227_friends".friends.orestes) -> WorkTable Scan on q1 (cost=0.00..0.22 rows=3 width=40) Filter: (lvl <= 3) -> Append (cost=0.00..88.96 rows=30 width=8) -> Index Scan using pk_friends_op on friends (cost=0.00..37.55 rows=17 width=8) Index Cond: ("20100227_friends".friends.orestes = q1.person) -> Index Scan using ux_friends_po on friends (cost=0.00..51.40 rows=13 width=8) Index Cond: ("20100227_friends".friends.pylades = q1.person) CTE q2 -> Recursive Union (cost=0.00..71529.77 rows=2753901 width=40) -> Result (cost=0.00..0.01 rows=1 width=0) -> Nested Loop (cost=0.00..1645.17 rows=275390 width=40) Join Filter: (q2.person = "20100227_friends".friends.orestes) -> WorkTable Scan on q2 (cost=0.00..0.22 rows=3 width=40) Filter: (lvl <= 3) -> Append (cost=0.00..88.96 rows=30 width=8) -> Index Scan using pk_friends_op on friends (cost=0.00..37.55 rows=17 width=8) Index Cond: ("20100227_friends".friends.orestes = q2.person) -> Index Scan using ux_friends_po on friends (cost=0.00..51.40 rows=13 width=8) Index Cond: ("20100227_friends".friends.pylades = q2.person) -> Sort (cost=1138486820.51..1233286454.49 rows=37919853589 width=72) Sort Key: ((q1.lvl + q2.lvl)) -> Merge Join (cost=849904.33..948887552.57 rows=37919853589 width=72) Merge Cond: (q1.person = q2.person) -> Sort (cost=424952.16..431836.92 rows=2753901 width=40) Sort Key: q1.person -> CTE Scan on q1 (cost=0.00..55078.02 rows=2753901 width=40) -> Materialize (cost=424952.16..459375.93 rows=2753901 width=40) -> Sort (cost=424952.16..431836.92 rows=2753901 width=40) Sort Key: q2.person -> CTE Scan on q2 (cost=0.00..55078.02 rows=2753901 width=40)
Note that the anchor part can not be used more than once in a recursive expression. To work around that, we had to join it to a derived table (a
UNION ALL of two copies of the table with the columns swapped). However, PostgreSQL's optimizer was smart enough to push the join predicate into the derived table and distribute the queries so that each part uses a corresponding index efficiently. This helps to traverse the tree and build the recordsets from both ends.
Each of the recordsets has only about 8,000 records, so scanning and joining them is very fast.
The whole query takes just a little longer than 0.5 seconds.
Hope that helps.
I'm always glad to answer the questions regarding database queries.