Selecting random rows
Usual way to select random rows from a MySQL table looks like this:
SELECT *
FROM t_random
ORDER BY
RAND(20090301)
LIMIT 10
This creates a fake RAND() column along with the query columns, sorts on it and returns bottom 10 results.
EXPLAIN EXTENDED on this query returns quite expected Using temporary; Using filesort
This plan will use Θ(N×log(N)) time on sorting alone. If temporary will hit memory limits, the time will increase even more rapidly.
Here’s the result of the query above:
| rnd_id | rnd_value |
|---|---|
| 527589 | Random value 527589 |
| 68841 | Random value 68841 |
| 209868 | Random value 209868 |
| 278268 | Random value 278268 |
| 669199 | Random value 669199 |
| 305024 | Random value 305024 |
| 391294 | Random value 391294 |
| 892500 | Random value 892500 |
| 946372 | Random value 946372 |
| 443867 | Random value 443867 |
| 10 rows fetched in 0,0271s (13,5870s) | |
More than 13 seconds, and it’s only a million rows. And financial databases may have billions.
But do we really need sorting anyway?
Say, we need to select 10 rows out of 1 000 000. If we just iterate the rows, which is the probability of each row to be selected on its turn?
As no rows differ from each other for our purposes, each remaining row may be selected with equal probabilty. This means, that if on a certain turn we have C rows left uninspected with L rows yet to select, then the probability of the current row to be selected will be just L / C. That simple.
Unfortunately, MySQL doesn’t provide us with row numbers, that’s why we’ll need to perform certain tricks with variables to work around this:
SELECT *
FROM (
SELECT @cnt := COUNT(*) + 1,
@lim := 10
FROM t_random
) vars
STRAIGHT_JOIN
(
SELECT r.*,
@lim := @lim - 1
FROM t_random r
WHERE (@cnt := @cnt - 1)
AND RAND(20090301) < @lim / @cnt
) i
In the first subquery (called vars) we initialize the variables, including COUNT(*). We’ll need STRAIGHT_JOIN here to ensure that the first query will be executed first.
Expressions in WHERE conditions, like @cnt := @cnt - 1, are evaluated on each row. Expressions in SELECT clause, like @lim := @lim - 1, are evaluated only when the row is selected (that’s is, after all WHERE conditions were evaluated and found to be true). That’s exactly what we want: we need to decrement @cnt each time a row is inspected, and decrement @lim each time a row is selected.
Running this query, we get:
| rnd_id | rnd_value |
|---|---|
| 68841 | Random value 68841 |
| 209868 | Random value 209868 |
| 278268 | Random value 278268 |
| 305024 | Random value 305024 |
| 527589 | Random value 527589 |
| 669199 | Random value 669199 |
| 881418 | Random value 881418 |
| 892500 | Random value 892500 |
| 946372 | Random value 946372 |
| 984592 | Random value 984592 |
| 10 rows fetched in 0,0165s (0,5680s) | |
Only 0,56 seconds. Much much nicer.
This is, of course, not very honest result, as I deliberately created table t_random using MyISAM storage. You may have noted that the COUNT(*) is used in the first subquery to initialize @cnt. It’s instant on a MyISAM table (as it’s stored in the table’s metadata), but takes some time on InnoDB.
But let’s see the results for InnoDB:
SELECT *
FROM t_random_innodb
ORDER BY
RAND(20090301)
LIMIT 10
| 10 rows fetched in 0,0104s (13,1379s) |
SELECT rnd_id, rnd_value
FROM (
SELECT @cnt := COUNT(*) + 1,
@lim := 10
FROM t_random_innodb
) vars
STRAIGHT_JOIN
(
SELECT r.*,
@lim := @lim - 1
FROM t_random_innodb r
WHERE (@cnt := @cnt - 1)
AND RAND(20090301) < @lim / @cnt
) i
| 10 rows fetched in 0,0011s (1,7266s) |
Also nice.
There is, of course, more place for optimization:
- It’s possible to cache or estimate
COUNT(*)for anInnoDBtable and use it to initialize@cnt. This of course may cut some latest rows or even return fewer rows than requested, but it’s OK if you don’t need high accuracy. - It’s possible to select
MIN()andMAX()id‘s from the table, do the things above in a loop (without actial table scan), store results into a temporary table and then join only 10 id’s you need (using correlated subqueries with>= id LIMIT 1, if yourid‘s are sparse). This also may return fewer rows than requested.
But if you just need to rewrite your working but inefficient query without much effort, that’s exactly what you need.
Subscribe in a reader
Hi there! I Know this post is a little old, but I just found it. I have a question, why do you use 20090301 in the RAND call? Is the MAX something? I’m kind of new in this, and if you don’t realize, my english sucks.
Cheers!
Sebastian
3 Jan 13 at 18:34
@Sebastian: it’s the date of the post, used as a random seed. If you call RAND() with a parameter, it will persist between the queries. I used it so that you can download the queries and reproduce them on your instance. Normally, you would just call RAND().
Quassnoi
3 Jan 13 at 18:58
Oh, thanks!! I will try your algorithm on my code!
Sebastian
3 Jan 13 at 19:01
Great work!
Jeon
9 Apr 13 at 10:02