## Selecting random rows

Usual way to select random rows from a `MySQL`

table looks like this:

SELECT * FROM t_random ORDER BY RAND(20090301) LIMIT 10

This creates a fake `RAND()`

column along with the query columns, sorts on it and returns bottom **10** results.

`EXPLAIN EXTENDED`

on this query returns quite expected `Using temporary; Using filesort`

This plan will use `Θ(N×log(N))`

time on sorting alone. If `temporary`

will hit memory limits, the time will increase even more rapidly.

Here's the result of the query above:

rnd_id | rnd_value |
---|---|

527589 | Random value 527589 |

68841 | Random value 68841 |

209868 | Random value 209868 |

278268 | Random value 278268 |

669199 | Random value 669199 |

305024 | Random value 305024 |

391294 | Random value 391294 |

892500 | Random value 892500 |

946372 | Random value 946372 |

443867 | Random value 443867 |

10 rows fetched in 0,0271s (13,5870s) |

More than **13** seconds, and it's only a million rows. And financial databases may have billions.

But do we really need sorting anyway?

Say, we need to select **10** rows out of **1 000 000**. If we just iterate the rows, which is the probability of each row to be selected on its turn?

As no rows differ from each other for our purposes, each remaining row may be selected with equal probabilty. This means, that if on a certain turn we have `C`

rows left uninspected with `L`

rows yet to select, then the probability of the current row to be selected will be just `L / C`

. That simple.

Unfortunately, `MySQL`

doesn't provide us with row numbers, that's why we'll need to perform certain tricks with variables to work around this:

SELECT * FROM ( SELECT @cnt := COUNT(*) + 1, @lim := 10 FROM t_random ) vars STRAIGHT_JOIN ( SELECT r.*, @lim := @lim - 1 FROM t_random r WHERE (@cnt := @cnt - 1) AND RAND(20090301) < @lim / @cnt ) i

In the first subquery (called `vars`

) we initialize the variables, including `COUNT(*)`

. We'll need `STRAIGHT_JOIN`

here to ensure that the first query will be executed first.

Expressions in `WHERE`

conditions, like `@cnt := @cnt - 1`

, are evaluated on each row. Expressions in `SELECT`

clause, like `@lim := @lim - 1`

, are evaluated only when the row is selected (that's is, *after* all `WHERE`

conditions were evaluated and found to be true). That's exactly what we want: we need to decrement `@cnt`

each time a row is **inspected**, and decrement `@lim`

each time a row is **selected**.

Running this query, we get:

rnd_id | rnd_value |
---|---|

68841 | Random value 68841 |

209868 | Random value 209868 |

278268 | Random value 278268 |

305024 | Random value 305024 |

527589 | Random value 527589 |

669199 | Random value 669199 |

881418 | Random value 881418 |

892500 | Random value 892500 |

946372 | Random value 946372 |

984592 | Random value 984592 |

10 rows fetched in 0,0165s (0,5680s) |

Only **0,56** seconds. Much much nicer.

This is, of course, not very honest result, as I deliberately created table `t_random`

using `MyISAM`

storage. You may have noted that the `COUNT(*)`

is used in the first subquery to initialize `@cnt`

. It's instant on a `MyISAM`

table (as it's stored in the table's metadata), but takes some time on `InnoDB`

.

But let's see the results for `InnoDB`

:

SELECT * FROM t_random_innodb ORDER BY RAND(20090301) LIMIT 10

10 rows fetched in 0,0104s (13,1379s) |

SELECT rnd_id, rnd_value FROM ( SELECT @cnt := COUNT(*) + 1, @lim := 10 FROM t_random_innodb ) vars STRAIGHT_JOIN ( SELECT r.*, @lim := @lim - 1 FROM t_random_innodb r WHERE (@cnt := @cnt - 1) AND RAND(20090301) < @lim / @cnt ) i

10 rows fetched in 0,0011s (1,7266s) |

Also nice.

There is, of course, more place for optimization:

- It's possible to cache or estimate
`COUNT(*)`

for an`InnoDB`

table and use it to initialize`@cnt`

. This of course may cut some latest rows or even return fewer rows than requested, but it's OK if you don't need high accuracy. - It's possible to select
`MIN()`

and`MAX()`

`id`

's from the table, do the things above in a**loop**(without actial table scan), store results into a temporary table and then join only**10**id's you need (using correlated subqueries with`>= id LIMIT 1`

, if your`id`

's are sparse). This also may return fewer rows than requested.

But if you just need to rewrite your working but inefficient query without much effort, that's exactly what you need.

Hi there! I Know this post is a little old, but I just found it. I have a question, why do you use 20090301 in the RAND call? Is the MAX something? I’m kind of new in this, and if you don’t realize, my english sucks.

Cheers!

Sebastian3 Jan 13 at 18:34

@Sebastian: it’s the date of the post, used as a random seed. If you call RAND() with a parameter, it will persist between the queries. I used it so that you can download the queries and reproduce them on your instance. Normally, you would just call RAND().

Quassnoi3 Jan 13 at 18:58

Oh, thanks!! I will try your algorithm on my code!

Sebastian3 Jan 13 at 19:01

Great work!

Jeon9 Apr 13 at 10:02

Hello,

I’m trying to get some random values only if they are not present in another table (the final goal is to insert the randoms value in this table).

Then I’ve included in the WHERE clause after RAND() < @lim / @cnt, something like AND t_rand.col NOT IN (SELECT t_2.col FROM t_2).

My table t_rand contains 10 million of INT. When t_2 is more than 1,5 or 2 million of records, I don't get the wanted number of rows while generating some random values.

I guess that I need somewhere to prevent @lim to be decremented ?

I tried to play with CASE, but without success.

Any idea ?

Thanks a lot !

PS:

Using the classical statement SELECT col FROM t_rand WHERE col NOT IN (SELECT t_2.col FROM t_2) ORDER BY RAND() LIMIT X everything looks good but too much long; moreover I need to play with millions of data so I can't use this.

Seb25 Aug 13 at 03:38