Row sampling
Sometimes we need to get a sample row from a table satisfying a certain condition. Like, get a first row for each month.
MS SQL and Oracle supply analytical function ROW_NUMBER() for this purpose.
Let’s create a simple table to illustrate our needs and see how do we query it.
CREATE TABLE t_limiter
(
id int(10) unsigned NOT NULL,
grouper int(10) unsigned NOT NULL,
value varchar(45) NOT NULL,
PRIMARY KEY (id),
KEY ix_limiter_grouper_id (grouper,id)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT *
FROM (
SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
FROM t_limiter l
) lo
WHERE rn = 1
MySQL of course lacks this function, but it can be easily emulated using:
SELECT *
FROM (
SELECT grouper, MIN(id) AS mid
FROM t_limiter
GROUP BY
grouper
) dl, t_limiter l
WHERE l.grouper = dl.grouper
AND l.id = mid
That’s fine, we get exactly one row for each grouper. But what is we need 15 rows for each month?
With analytical functions, it’s still simple:
SELECT *
FROM (
SELECT l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
FROM t_limiter l
) lo
WHERE rn <= 15
In MySQL we can do, umm…
Grouping won’t help us in this case, as it always returns 1 row for each grouper, and we need 15 rows for a join. Same with correlated subqueries: they return 1 row.
In this case, we can emulate analytical functions using tricks with session variables.
Let’s populate the table and see how the query works.
We will create 999,999 records with grouper = id % 32, which will give us grouper‘s from 0 to 31, and 1 more record with grouper = 32. This final record will have id of 1,000,000.
First, we need to select everything from the table, ordered first by grouper then by id (like analytics functions do). Session variables expressions are then being evaluated in the WHERE clause of the query.
The internal counter @r, being initially set to the number or rows we want to select for each grouper, is decremented on each row, and the when it falls under zero, the rows start being filtered out.
Whenever the grouper value changes (thus indicating that the new PARTITION started), the value of @r is reset back to limit.
SELECT l.*
FROM (
SELECT @lim := 15,
@cg := -1
) vars,
t_limiter l
WHERE CASE WHEN @cg <> grouper THEN @r := @lim ELSE 1 END > 0
AND (@r := @r - 1) >= 0
AND (@cg := grouper) IS NOT NULL
ORDER BY
grouper, id
Being run, the query returns the following:
| id | grouper | value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 0 | Value 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 64 | 0 | Value 64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 447 | 31 | Value 447 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 479 | 31 | Value 479 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1000000 | 32 | Value 1000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 481 rows fetched in 0,0044s (3,4241s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As you can see, the query above works and completes in Θ(n) time. When you have lots of groups and just a few items in each group, that’s exactly what you need.
To be continued.
Subscribe in a reader
Thank you for explanation! I tried a slightly different variant of it and it seems that the row sampling with variables is somehow dependent on the physical order of records. I tried:
SELECT l.*
FROM (
SELECT @lim := 1,
@cg := -1
) vars, (SELECT * FROM group_limit ORDER BY rank DESC) l
WHERE CASE WHEN @cg grouper THEN @r := @lim ELSE 1 END > 0
AND (@r := @r – 1) >= 0
AND (@cg := grouper) IS NOT NULL
ORDER BY grouper
without success (I got too many results). Any hints?
prinz
9 Nov 12 at 13:17
@prinz: session variables are not generally to be used like this. They are undocumented and their behaviour changes between even minor releases. I only keep this post for historical reasons.
Quassnoi
9 Nov 12 at 14:05