Selecting random rows
Usual way to select random rows from a MySQL
table looks like this:
SELECT * FROM t_random ORDER BY RAND(20090301) LIMIT 10
This creates a fake RAND()
column along with the query columns, sorts on it and returns bottom 10 results.
EXPLAIN EXTENDED
on this query returns quite expected Using temporary; Using filesort
This plan will use Θ(N×log(N))
time on sorting alone. If temporary
will hit memory limits, the time will increase even more rapidly.
Here's the result of the query above:
rnd_id | rnd_value |
---|---|
527589 | Random value 527589 |
68841 | Random value 68841 |
209868 | Random value 209868 |
278268 | Random value 278268 |
669199 | Random value 669199 |
305024 | Random value 305024 |
391294 | Random value 391294 |
892500 | Random value 892500 |
946372 | Random value 946372 |
443867 | Random value 443867 |
10 rows fetched in 0,0271s (13,5870s) |
More than 13 seconds, and it's only a million rows. And financial databases may have billions.
But do we really need sorting anyway?
Say, we need to select 10 rows out of 1 000 000. If we just iterate the rows, which is the probability of each row to be selected on its turn?
As no rows differ from each other for our purposes, each remaining row may be selected with equal probabilty. This means, that if on a certain turn we have C
rows left uninspected with L
rows yet to select, then the probability of the current row to be selected will be just L / C
. That simple.
Unfortunately, MySQL
doesn't provide us with row numbers, that's why we'll need to perform certain tricks with variables to work around this:
SELECT * FROM ( SELECT @cnt := COUNT(*) + 1, @lim := 10 FROM t_random ) vars STRAIGHT_JOIN ( SELECT r.*, @lim := @lim - 1 FROM t_random r WHERE (@cnt := @cnt - 1) AND RAND(20090301) < @lim / @cnt ) i
In the first subquery (called vars
) we initialize the variables, including COUNT(*)
. We'll need STRAIGHT_JOIN
here to ensure that the first query will be executed first.
Expressions in WHERE
conditions, like @cnt := @cnt - 1
, are evaluated on each row. Expressions in SELECT
clause, like @lim := @lim - 1
, are evaluated only when the row is selected (that's is, after all WHERE
conditions were evaluated and found to be true). That's exactly what we want: we need to decrement @cnt
each time a row is inspected, and decrement @lim
each time a row is selected.
Running this query, we get:
rnd_id | rnd_value |
---|---|
68841 | Random value 68841 |
209868 | Random value 209868 |
278268 | Random value 278268 |
305024 | Random value 305024 |
527589 | Random value 527589 |
669199 | Random value 669199 |
881418 | Random value 881418 |
892500 | Random value 892500 |
946372 | Random value 946372 |
984592 | Random value 984592 |
10 rows fetched in 0,0165s (0,5680s) |
Only 0,56 seconds. Much much nicer.
This is, of course, not very honest result, as I deliberately created table t_random
using MyISAM
storage. You may have noted that the COUNT(*)
is used in the first subquery to initialize @cnt
. It's instant on a MyISAM
table (as it's stored in the table's metadata), but takes some time on InnoDB
.
But let's see the results for InnoDB
:
SELECT * FROM t_random_innodb ORDER BY RAND(20090301) LIMIT 10
10 rows fetched in 0,0104s (13,1379s) |
SELECT rnd_id, rnd_value FROM ( SELECT @cnt := COUNT(*) + 1, @lim := 10 FROM t_random_innodb ) vars STRAIGHT_JOIN ( SELECT r.*, @lim := @lim - 1 FROM t_random_innodb r WHERE (@cnt := @cnt - 1) AND RAND(20090301) < @lim / @cnt ) i
10 rows fetched in 0,0011s (1,7266s) |
Also nice.
There is, of course, more place for optimization:
- It's possible to cache or estimate
COUNT(*)
for anInnoDB
table and use it to initialize@cnt
. This of course may cut some latest rows or even return fewer rows than requested, but it's OK if you don't need high accuracy. - It's possible to select
MIN()
andMAX()
id
's from the table, do the things above in a loop (without actial table scan), store results into a temporary table and then join only 10 id's you need (using correlated subqueries with>= id LIMIT 1
, if yourid
's are sparse). This also may return fewer rows than requested.
But if you just need to rewrite your working but inefficient query without much effort, that's exactly what you need.
Hi there! I Know this post is a little old, but I just found it. I have a question, why do you use 20090301 in the RAND call? Is the MAX something? I’m kind of new in this, and if you don’t realize, my english sucks.
Cheers!
Sebastian
3 Jan 13 at 18:34
@Sebastian: it’s the date of the post, used as a random seed. If you call RAND() with a parameter, it will persist between the queries. I used it so that you can download the queries and reproduce them on your instance. Normally, you would just call RAND().
Quassnoi
3 Jan 13 at 18:58
Oh, thanks!! I will try your algorithm on my code!
Sebastian
3 Jan 13 at 19:01
Great work!
Jeon
9 Apr 13 at 10:02
Hello,
I’m trying to get some random values only if they are not present in another table (the final goal is to insert the randoms value in this table).
Then I’ve included in the WHERE clause after RAND() < @lim / @cnt, something like AND t_rand.col NOT IN (SELECT t_2.col FROM t_2).
My table t_rand contains 10 million of INT. When t_2 is more than 1,5 or 2 million of records, I don't get the wanted number of rows while generating some random values.
I guess that I need somewhere to prevent @lim to be decremented ?
I tried to play with CASE, but without success.
Any idea ?
Thanks a lot !
PS:
Using the classical statement SELECT col FROM t_rand WHERE col NOT IN (SELECT t_2.col FROM t_2) ORDER BY RAND() LIMIT X everything looks good but too much long; moreover I need to play with millions of data so I can't use this.
Seb2
5 Aug 13 at 03:38
Hello
I hope you see my comment and answer to me!
I am using your method for random select from my table but every time, the response is the same and nothing changes!
for example : when I want to select 5 rows from 1000 rows, every time , it returns rows number 1,3,5,6,7
who can I fix it?
Neo
26 Oct 14 at 16:21
@Neo: if you are calling RAND(20090301) like in the sample, you will get the same results every time.
I made the sample so that the readers can reproduce my results.
On production environment, call RAND() without parameters so that results are truly random.
Quassnoi
27 Oct 14 at 13:00
Thanks alot!
now I have another problem with this query! I want to add whereClause to query something like this :
select id,title,… from question
where level=’Easy’
order by rand() limit 3;
I changed your example to this one :
SELECT id, title,…
FROM (
SELECT @cnt := COUNT(*) + 1,@lim:=2
FROM question
) vars
STRAIGHT_JOIN (SELECT r.*,@lim:=@lim-1
FROM question r
WHERE(@cnt:=@cnt-1) AND RAND(87724262) < @lim / @cnt ) i;
I want to know where I can add my whereClause!
I'm new to mysql!
thanks for your help
Neo
28 Oct 14 at 14:43
select floor(RAND() * (SELECT MAX(key) FROM table)) from table limit 10
Jose Javier Fernandez Mendoza
5 Nov 14 at 15:22
select * from tabla where key in (select floor(RAND() * (SELECT MAX(key) FROM table)) from table limit 10)
Jose Javier Fernandez Mendoza
5 Nov 14 at 15:24
@Jose: you know you can’t use LIMIT like that in MySQL, right? And even if you could, your query would miss the gaps in sequential ids, if any?
Quassnoi
5 Nov 14 at 16:16
The algorithm still requires a full table scan. (Plus a second scan for COUNT(*) if InnoDB.)
FLUSH STATUS;
SELECT …;
SHOW SESSION STATUS LIKE ‘Handler%’;
demonstrates it, even for small tables.
Rick James
7 Aug 15 at 00:48
Hello, thank you for your code. I have tried it and did work. I will like to know how to and Left join to another table to it
Thank you
Davcaddo
30 Oct 15 at 15:18
I have made a study of how to fetch a small number of random rows without doing a full table scan. Here is a compilation of the few techniques that work efficiently, even on large tables: http://mysql.rjweb.org/doc.php/random
Rick James
30 Oct 15 at 18:52