Row sampling
Sometimes we need to get a sample row from a table satisfying a certain condition. Like, get a first row for each month.
MS SQL and Oracle supply analytical function ROW_NUMBER() for this purpose.
Let's create a simple table to illustrate our needs and see how do we query it.
CREATE TABLE t_limiter
(
id int(10) unsigned NOT NULL,
grouper int(10) unsigned NOT NULL,
value varchar(45) NOT NULL,
PRIMARY KEY (id),
KEY ix_limiter_grouper_id (grouper,id)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT *
FROM (
SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
FROM t_limiter l
) lo
WHERE rn = 1
MySQL of course lacks this function, but it can be easily emulated using:
SELECT *
FROM (
SELECT grouper, MIN(id) AS mid
FROM t_limiter
GROUP BY
grouper
) dl, t_limiter l
WHERE l.grouper = dl.grouper
AND l.id = mid
That's fine, we get exactly one row for each grouper. But what is we need 15 rows for each month?
With analytical functions, it's still simple:
SELECT *
FROM (
SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
FROM t_limiter l
) lo
WHERE rn <= 15
[/sourcecode]
In <code>MySQL</code> we can do, umm…
Grouping won't help us in this case, as it always returns <strong>1</strong> row for each <code>grouper</code>, and we need <strong>15</strong> rows for a join. Same with correlated subqueries: they return <strong>1</strong> row.
In this case, we can emulate analytical functions using tricks with session variables.
Let's populate the table and see how the query works.
We will create <strong>999,999</strong> records with <code>grouper = id % 32</code>, which will give us <code>grouper</code>'s from <strong>0</strong> to <strong>31</strong>, and <strong>1</strong> more record with <code>grouper = 32</code>. This final record will have <code>id</code> of <strong>1,000,000</strong>.
<a href="#" onclick="xcollapse('X10374');return false;"><strong>Table population details</strong></a>
<br />
<div id="X10374" style="display: none; ">
DELIMITER $$
CREATE PROCEDURE prc_fill_limiter(_cnt INT, _modulus INT)
BEGIN
DECLARE i INT;
SET i = 1;
WHILE i < _cnt DO
INSERT
INTO t_limiter (id, grouper, value)
VALUES (i, i % _modulus, CONCAT('Value ', i));
SET i = i + 1;
END WHILE;
INSERT
INTO t_limiter (id, grouper, value)
VALUES (i, _modulus, CONCAT('Value ', i));
END;
$$
DELIMITER ;
START TRANSACTION;
CALL prc_fill_limiter (1000000, 32);
COMMIT;
[/sourcecode]
</div>
First, we need to select everything from the table, ordered first by <code>grouper</code> then by <code>id</code> (like analytics functions do). Session variables expressions are then being evaluated in the <code>WHERE</code> clause of the query.
The internal counter <code>@r</code>, being initially set to the number or rows we want to select for each <code>grouper</code>, is decremented on each row, and the when it falls under zero, the rows start being filtered out.
Whenever the <code>grouper</code> value changes (thus indicating that the new <code>PARTITION</code> started), the value of <code>@r</code> is reset back to limit.
SELECT l.*
FROM (
SELECT @lim := 15,
@cg := -1
) vars,
t_limiter l
WHERE CASE WHEN @cg <> grouper THEN @r := @lim ELSE 1 END > 0
AND (@r := @r - 1) >= 0
AND (@cg := grouper) IS NOT NULL
ORDER BY
grouper, id
Being run, the query returns the following:
| id | grouper | value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 0 | Value 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 64 | 0 | Value 64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 447 | 31 | Value 447 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 479 | 31 | Value 479 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1000000 | 32 | Value 1000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 481 rows fetched in 0,0044s (3,4241s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As you can see, the query above works and completes in Θ(n) time. When you have lots of groups and just a few items in each group, that's exactly what you need.
To be continued.
Subscribe in a reader
Thank you for explanation! I tried a slightly different variant of it and it seems that the row sampling with variables is somehow dependent on the physical order of records. I tried:
SELECT l.*
FROM (
SELECT @lim := 1,
@cg := -1
) vars, (SELECT * FROM group_limit ORDER BY rank DESC) l
WHERE CASE WHEN @cg grouper THEN @r := @lim ELSE 1 END > 0
AND (@r := @r – 1) >= 0
AND (@cg := grouper) IS NOT NULL
ORDER BY grouper
without success (I got too many results). Any hints?
prinz
9 Nov 12 at 13:17
@prinz: session variables are not generally to be used like this. They are undocumented and their behaviour changes between even minor releases. I only keep this post for historical reasons.
Quassnoi
9 Nov 12 at 14:05