Shuffling rows: PostgreSQL at EXPLAIN EXTENDED

Shuffling rows: PostgreSQL

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Josh asks:

I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.

This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.

Is it possible to do this with a single SQL query or I need to create a cursor?

This is in PostgreSQL 8.4

PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.

To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.

PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.

Here's what we should do to build the playlist:

We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
The base part of the CTE is just a random record (fetched with ORDER BY RANDOM() LIMIT 1)
The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the id of the random track just selected
The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator <@ (contained by)
The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it ([1:10])

Let's create a sample table:

Table creation details

CREATE TABLE t_track
(
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20) NOT NULL
);

INSERT
INTO    t_track
SELECT  s, 'Track ' || s
FROM    generate_series(1, 1000) s;

ANALYZE t_track;

This table is quite simple: it just contains 1,000 tracks with generated names.

And here's the query:

WITH    RECURSIVE
shuffle AS
(
SELECT  *
FROM    (
SELECT  id, name, ARRAY[id] AS queue
FROM    t_track
ORDER BY
RANDOM()
LIMIT 1
) q
UNION ALL
SELECT  *
FROM    (
SELECT  t.id, t.name, (t.id || s.queue)[1:10]
FROM    shuffle s
JOIN    t_track t
ON      NOT ARRAY[t.id] &lt;@ s.queue
                ORDER BY
                        RANDOM()
                LIMIT 1
                ) q
        )
SELECT  id, name, queue::VARCHAR
FROM    shuffle
LIMIT 30

View query details

id	name	queue
739	Track 739	{739}
811	Track 811	{811,739}
216	Track 216	{216,811,739}
192	Track 192	{192,216,811,739}
286	Track 286	{286,192,216,811,739}
287	Track 287	{287,286,192,216,811,739}
856	Track 856	{856,287,286,192,216,811,739}
371	Track 371	{371,856,287,286,192,216,811,739}
336	Track 336	{336,371,856,287,286,192,216,811,739}
558	Track 558	{558,336,371,856,287,286,192,216,811,739}
99	Track 99	{99,558,336,371,856,287,286,192,216,811}
462	Track 462	{462,99,558,336,371,856,287,286,192,216}
653	Track 653	{653,462,99,558,336,371,856,287,286,192}
682	Track 682	{682,653,462,99,558,336,371,856,287,286}
329	Track 329	{329,682,653,462,99,558,336,371,856,287}
365	Track 365	{365,329,682,653,462,99,558,336,371,856}
72	Track 72	{72,365,329,682,653,462,99,558,336,371}
841	Track 841	{841,72,365,329,682,653,462,99,558,336}
159	Track 159	{159,841,72,365,329,682,653,462,99,558}
521	Track 521	{521,159,841,72,365,329,682,653,462,99}
736	Track 736	{736,521,159,841,72,365,329,682,653,462}
759	Track 759	{759,736,521,159,841,72,365,329,682,653}
142	Track 142	{142,759,736,521,159,841,72,365,329,682}
607	Track 607	{607,142,759,736,521,159,841,72,365,329}
331	Track 331	{331,607,142,759,736,521,159,841,72,365}
957	Track 957	{957,331,607,142,759,736,521,159,841,72}
985	Track 985	{985,957,331,607,142,759,736,521,159,841}
702	Track 702	{702,985,957,331,607,142,759,736,521,159}
914	Track 914	{914,702,985,957,331,607,142,759,736,521}
569	Track 569	{569,914,702,985,957,331,607,142,759,736}
30 rows fetched in 0.0012s (0.1246s)

Limit  (cost=3444.86..3445.13 rows=11 width=94)
  CTE shuffle
    ->  Recursive Union  (cost=23.50..3444.86 rows=11 width=94)
          ->  Subquery Scan q  (cost=23.50..23.51 rows=1 width=94)
                ->  Limit  (cost=23.50..23.50 rows=1 width=13)
                      ->  Sort  (cost=23.50..26.00 rows=1000 width=13)
                            Sort Key: (random())
                            ->  Seq Scan on t_track  (cost=0.00..18.50 rows=1000 width=13)
          ->  Subquery Scan q  (cost=342.10..342.11 rows=1 width=94)
                ->  Limit  (cost=342.10..342.10 rows=1 width=45)
                      ->  Sort  (cost=342.10..367.08 rows=9990 width=45)
                            Sort Key: (random())
                            ->  Nested Loop  (cost=17.00..292.15 rows=9990 width=45)
                                  Join Filter: (NOT (ARRAY[t.id] <@ s.queue))
                                  ->  WorkTable Scan on shuffle s  (cost=0.00..0.20 rows=10 width=32)
                                  ->  Materialize  (cost=17.00..27.00 rows=1000 width=13)
                                        ->  Seq Scan on t_track t  (cost=0.00..16.00 rows=1000 width=13)
  ->  CTE Scan on shuffle  (cost=0.00..0.28 rows=11 width=94)

This query selects first 30 records but the LIMIT clause can be changed to select an arbitrary number of records (including that exceeding 1,000), since we don't apply any limits into the recursive part of the query.

Normally, the queue would be hidden but I left it just to illustrate what's going on. As you can see, the queue holds the id's of last 10 records.

The query runs for 120 ms which is quite fast but could be yet improved using approaches described in PostgreSQL 8.4: sampling random rows. However, this will make the query too hard to read and ORDER BY RANDOM() is just fine to demonstrates the principle.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

October 6th, 2009 at 11:00 pm

Posted in PostgreSQL

EXPLAIN EXTENDED