EXPLAIN EXTENDED

How to create fast database queries

Six degrees of separation

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Kathy asks:

I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.

If your site grows popular, most probably, they are not. But we better check.

On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.

The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.

Let's create a sample table:

Table creation details

This table stores records for 1,000,000 people having 20 friends each in average. The first column is named orestes and the second one pylades.

With new PostgreSQL 8.4 it is easy to write a recursive query that would traverse the relationship graph up to the given level and stop on the first match.

However, each recursion step requires a join, and as the number of records in the input recordset for the recusion grows with level, the joins become less and less efficient. The number of records grows exponentially, and on level 6 there will be about 20 ^ 6 = 64,000,000 records on input. This is just too much for a join with a 20,000,000 records table.

As the chain length increases, the tree diverges, the number of the records grows and it becomes more and more costly to join them with the table.

To work around this, we should use Bogdan the tunnel builder's algorithm.

British and French governments submit a tender to build a tunnel under the Channel. Many companies apply, all demanding years of time and billions pounds of money, so their offers are refused.

One day, Bogdan drops in and offers his services.

How much money would you demand for your work?, the official asks. Me and my brother Roman are good eaters, so we will need to buy a decent meal every day. 50 pounds a day will be OK.

That's pretty cheap; and how much time will you need? Me and my brother Roman are fast diggers; one mile a day I think we will dig.

Oh, that's pretty fast! But how will you be able to work on such a low budget in such a short time?, the official asks out of curiosity.

That's simple, Bogdan answers, I start digging from the British coast, my brother Roman starts digging from the French coast; the moment we meet, the work is over.

OK, says the official, but what if you don't meet?.

No worries then: you get two tunnels for the price of one

We should do a similar thing here. Instead of traversing the 6 levels from the beginning, we will traverse just 3 levels from each side, then join the resulting recordsets and hope some matching records will be found.

Traversing only 3 levels will be quite fast; and the resulting recordsets will be of moderate size so joining them will be easy.

As an extra, we will return the shortest path from one person to the other. To do this, we will need to record the friendship chain in an array. PostgreSQL does not offer an easy way to reverse an array, so in the first recordset, we will append the friends to the array, while in the second one we will prepend them. This way, we should just concatenate the resulting tunnelsarrays.

Here's the query:

WITH    RECURSIVE
        q1 (person, chain, lvl) AS
        (
        SELECT  123456, ARRAY[123456], 1
        UNION ALL
        SELECT  friend, chain || friend, lvl + 1
        FROM    (
                SELECT  q1.*,
                        friend
                FROM    q1
                JOIN    (
                        SELECT  orestes AS me, pylades AS friend
                        FROM    friends
                        UNION ALL
                        SELECT  pylades AS me, orestes AS friend
                        FROM    friends
                        ) f
                ON      person = me
                WHERE   lvl <= 3
                ) qo
        ),
        q2 (person, chain, lvl) AS
        (
        SELECT  654321, ARRAY[654321], 1
        UNION ALL
        SELECT  friend, friend || chain, lvl + 1
        FROM    (
                SELECT  q2.*,
                        friend
                FROM    q2
                JOIN    (
                        SELECT  orestes AS me, pylades AS friend
                        FROM    friends
                        UNION ALL
                        SELECT  pylades AS me, orestes AS friend
                        FROM    friends
                        ) f
                ON      person = me
                WHERE   lvl <= 3
                ) qo
        )
SELECT  (q1.chain || q2.chain[2:q2.lvl])::TEXT AS chain
FROM    q1
JOIN    q2
ON      q2.person = q1.person
ORDER BY
        q1.lvl + q2.lvl
LIMIT 1

View query details

Note that the anchor part can not be used more than once in a recursive expression. To work around that, we had to join it to a derived table (a UNION ALL of two copies of the table with the columns swapped). However, PostgreSQL's optimizer was smart enough to push the join predicate into the derived table and distribute the queries so that each part uses a corresponding index efficiently. This helps to traverse the tree and build the recordsets from both ends.

Each of the recordsets has only about 8,000 records, so scanning and joining them is very fast.

The whole query takes just a little longer than 0.5 seconds.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 27th, 2010 at 11:00 pm

Posted in PostgreSQL

Leave a Reply