Row sampling
Sometimes we need to get a sample row from a table satisfying a certain condition. Like, get a first row for each month.
MS SQL
and Oracle
supply analytical function ROW_NUMBER()
for this purpose.
Let's create a simple table to illustrate our needs and see how do we query it.
CREATE TABLE t_limiter ( id int(10) unsigned NOT NULL, grouper int(10) unsigned NOT NULL, value varchar(45) NOT NULL, PRIMARY KEY (id), KEY ix_limiter_grouper_id (grouper,id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT * FROM ( SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn FROM t_limiter l ) lo WHERE rn = 1
MySQL
of course lacks this function, but it can be easily emulated using:
SELECT * FROM ( SELECT grouper, MIN(id) AS mid FROM t_limiter GROUP BY grouper ) dl, t_limiter l WHERE l.grouper = dl.grouper AND l.id = mid
That's fine, we get exactly one row for each grouper
. But what is we need 15 rows for each month?
With analytical functions, it's still simple:
SELECT * FROM ( SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn FROM t_limiter l ) lo WHERE rn <= 15 [/sourcecode] In <code>MySQL</code> we can do, umm… Grouping won't help us in this case, as it always returns <strong>1</strong> row for each <code>grouper</code>, and we need <strong>15</strong> rows for a join. Same with correlated subqueries: they return <strong>1</strong> row. In this case, we can emulate analytical functions using tricks with session variables. Let's populate the table and see how the query works. We will create <strong>999,999</strong> records with <code>grouper = id % 32</code>, which will give us <code>grouper</code>'s from <strong>0</strong> to <strong>31</strong>, and <strong>1</strong> more record with <code>grouper = 32</code>. This final record will have <code>id</code> of <strong>1,000,000</strong>. <a href="#" onclick="xcollapse('X10374');return false;"><strong>Table population details</strong></a> <br /> <div id="X10374" style="display: none; "> DELIMITER $$ CREATE PROCEDURE prc_fill_limiter(_cnt INT, _modulus INT) BEGIN DECLARE i INT; SET i = 1; WHILE i < _cnt DO INSERT INTO t_limiter (id, grouper, value) VALUES (i, i % _modulus, CONCAT('Value ', i)); SET i = i + 1; END WHILE; INSERT INTO t_limiter (id, grouper, value) VALUES (i, _modulus, CONCAT('Value ', i)); END; $$ DELIMITER ; START TRANSACTION; CALL prc_fill_limiter (1000000, 32); COMMIT; [/sourcecode] </div> First, we need to select everything from the table, ordered first by <code>grouper</code> then by <code>id</code> (like analytics functions do). Session variables expressions are then being evaluated in the <code>WHERE</code> clause of the query. The internal counter <code>@r</code>, being initially set to the number or rows we want to select for each <code>grouper</code>, is decremented on each row, and the when it falls under zero, the rows start being filtered out. Whenever the <code>grouper</code> value changes (thus indicating that the new <code>PARTITION</code> started), the value of <code>@r</code> is reset back to limit. SELECT l.* FROM ( SELECT @lim := 15, @cg := -1 ) vars, t_limiter l WHERE CASE WHEN @cg <> grouper THEN @r := @lim ELSE 1 END > 0 AND (@r := @r - 1) >= 0 AND (@cg := grouper) IS NOT NULL ORDER BY grouper, id
Being run, the query returns the following:
id | grouper | value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | 0 | Value 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64 | 0 | Value 64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
447 | 31 | Value 447 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
479 | 31 | Value 479 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000000 | 32 | Value 1000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
481 rows fetched in 0,0044s (3,4241s) |
As you can see, the query above works and completes in Θ(n)
time. When you have lots of groups and just a few items in each group, that's exactly what you need.
To be continued.
Thank you for explanation! I tried a slightly different variant of it and it seems that the row sampling with variables is somehow dependent on the physical order of records. I tried:
SELECT l.*
FROM (
SELECT @lim := 1,
@cg := -1
) vars, (SELECT * FROM group_limit ORDER BY rank DESC) l
WHERE CASE WHEN @cg grouper THEN @r := @lim ELSE 1 END > 0
AND (@r := @r – 1) >= 0
AND (@cg := grouper) IS NOT NULL
ORDER BY grouper
without success (I got too many results). Any hints?
prinz
9 Nov 12 at 13:17
@prinz: session variables are not generally to be used like this. They are undocumented and their behaviour changes between even minor releases. I only keep this post for historical reasons.
Quassnoi
9 Nov 12 at 14:05