EXPLAIN EXTENDED

How to create fast database queries

Selecting random rows

with 8 comments

Usual way to select random rows from a MySQL table looks like this:

SELECT  *
FROM    t_random
ORDER BY
        RAND(20090301)
LIMIT 10

This creates a fake RAND() column along with the query columns, sorts on it and returns bottom 10 results.

EXPLAIN EXTENDED on this query returns quite expected Using temporary; Using filesort

This plan will use Θ(N×log(N)) time on sorting alone. If temporary will hit memory limits, the time will increase even more rapidly.

Here's the result of the query above:

rnd_id rnd_value
527589 Random value 527589
68841 Random value 68841
209868 Random value 209868
278268 Random value 278268
669199 Random value 669199
305024 Random value 305024
391294 Random value 391294
892500 Random value 892500
946372 Random value 946372
443867 Random value 443867
10 rows fetched in 0,0271s (13,5870s)

More than 13 seconds, and it's only a million rows. And financial databases may have billions.

But do we really need sorting anyway?

Say, we need to select 10 rows out of 1 000 000. If we just iterate the rows, which is the probability of each row to be selected on its turn?

As no rows differ from each other for our purposes, each remaining row may be selected with equal probabilty. This means, that if on a certain turn we have C rows left uninspected with L rows yet to select, then the probability of the current row to be selected will be just L / C. That simple.

Unfortunately, MySQL doesn't provide us with row numbers, that's why we'll need to perform certain tricks with variables to work around this:

SELECT  *
FROM    (
        SELECT  @cnt := COUNT(*) + 1,
                @lim := 10
        FROM    t_random
        ) vars
STRAIGHT_JOIN
        (
        SELECT  r.*,
                @lim := @lim - 1
        FROM    t_random r
        WHERE   (@cnt := @cnt - 1)
                AND RAND(20090301) < @lim / @cnt
        ) i

In the first subquery (called vars) we initialize the variables, including COUNT(*). We'll need STRAIGHT_JOIN here to ensure that the first query will be executed first.

Expressions in WHERE conditions, like @cnt := @cnt - 1, are evaluated on each row. Expressions in SELECT clause, like @lim := @lim - 1, are evaluated only when the row is selected (that's is, after all WHERE conditions were evaluated and found to be true). That's exactly what we want: we need to decrement @cnt each time a row is inspected, and decrement @lim each time a row is selected.

Running this query, we get:

rnd_id rnd_value
68841 Random value 68841
209868 Random value 209868
278268 Random value 278268
305024 Random value 305024
527589 Random value 527589
669199 Random value 669199
881418 Random value 881418
892500 Random value 892500
946372 Random value 946372
984592 Random value 984592
10 rows fetched in 0,0165s (0,5680s)

Only 0,56 seconds. Much much nicer.

This is, of course, not very honest result, as I deliberately created table t_random using MyISAM storage. You may have noted that the COUNT(*) is used in the first subquery to initialize @cnt. It's instant on a MyISAM table (as it's stored in the table's metadata), but takes some time on InnoDB.

But let's see the results for InnoDB:

SELECT  *
FROM    t_random_innodb
ORDER BY
        RAND(20090301)
LIMIT 10
10 rows fetched in 0,0104s (13,1379s)
SELECT  rnd_id, rnd_value
FROM    (
        SELECT  @cnt := COUNT(*) + 1,
                @lim := 10
        FROM    t_random_innodb
        ) vars
STRAIGHT_JOIN
        (
        SELECT  r.*,
                @lim := @lim - 1
        FROM    t_random_innodb r
        WHERE   (@cnt := @cnt - 1)
                AND RAND(20090301) < @lim / @cnt
        ) i
10 rows fetched in 0,0011s (1,7266s)

Also nice.

There is, of course, more place for optimization:

  1. It's possible to cache or estimate COUNT(*) for an InnoDB table and use it to initialize @cnt. This of course may cut some latest rows or even return fewer rows than requested, but it's OK if you don't need high accuracy.
  2. It's possible to select MIN() and MAX() id's from the table, do the things above in a loop (without actial table scan), store results into a temporary table and then join only 10 id's you need (using correlated subqueries with >= id LIMIT 1, if your id's are sparse). This also may return fewer rows than requested.

But if you just need to rewrite your working but inefficient query without much effort, that's exactly what you need.

Written by Quassnoi

March 1st, 2009 at 9:00 pm

Posted in MySQL

8 Responses to 'Selecting random rows'

Subscribe to comments with RSS

  1. Hi there! I Know this post is a little old, but I just found it. I have a question, why do you use 20090301 in the RAND call? Is the MAX something? I’m kind of new in this, and if you don’t realize, my english sucks.

    Cheers!

    Sebastian

    3 Jan 13 at 18:34

  2. @Sebastian: it’s the date of the post, used as a random seed. If you call RAND() with a parameter, it will persist between the queries. I used it so that you can download the queries and reproduce them on your instance. Normally, you would just call RAND().

    Quassnoi

    3 Jan 13 at 18:58

  3. Oh, thanks!! I will try your algorithm on my code!

    Sebastian

    3 Jan 13 at 19:01

  4. Great work!

    Jeon

    9 Apr 13 at 10:02

  5. Hello,
    I’m trying to get some random values only if they are not present in another table (the final goal is to insert the randoms value in this table).
    Then I’ve included in the WHERE clause after RAND() < @lim / @cnt, something like AND t_rand.col NOT IN (SELECT t_2.col FROM t_2).
    My table t_rand contains 10 million of INT. When t_2 is more than 1,5 or 2 million of records, I don't get the wanted number of rows while generating some random values.
    I guess that I need somewhere to prevent @lim to be decremented ?
    I tried to play with CASE, but without success.
    Any idea ?

    Thanks a lot !

    PS:
    Using the classical statement SELECT col FROM t_rand WHERE col NOT IN (SELECT t_2.col FROM t_2) ORDER BY RAND() LIMIT X everything looks good but too much long; moreover I need to play with millions of data so I can't use this.

    Seb2

    5 Aug 13 at 03:38

  6. Hello
    I hope you see my comment and answer to me!
    I am using your method for random select from my table but every time, the response is the same and nothing changes!
    for example : when I want to select 5 rows from 1000 rows, every time , it returns rows number 1,3,5,6,7
    who can I fix it?

    Neo

    26 Oct 14 at 16:21

  7. @Neo: if you are calling RAND(20090301) like in the sample, you will get the same results every time.

    I made the sample so that the readers can reproduce my results.

    On production environment, call RAND() without parameters so that results are truly random.

    Quassnoi

    27 Oct 14 at 13:00

  8. Thanks alot!
    now I have another problem with this query! I want to add whereClause to query something like this :
    select id,title,… from question
    where level=’Easy’
    order by rand() limit 3;
    I changed your example to this one :
    SELECT id, title,…
    FROM (
    SELECT @cnt := COUNT(*) + 1,@lim:=2
    FROM question
    ) vars
    STRAIGHT_JOIN (SELECT r.*,@lim:=@lim-1
    FROM question r
    WHERE(@cnt:=@cnt-1) AND RAND(87724262) < @lim / @cnt ) i;
    I want to know where I can add my whereClause!
    I'm new to mysql!
    thanks for your help

    Neo

    28 Oct 14 at 14:43

Leave a Reply