EXPLAIN EXTENDED

How to create fast database queries

Row sampling

with 2 comments

Sometimes we need to get a sample row from a table satisfying a certain condition. Like, get a first row for each month.

MS SQL and Oracle supply analytical function ROW_NUMBER() for this purpose.

Let's create a simple table to illustrate our needs and see how do we query it.

CREATE TABLE t_limiter
   (
   id int(10) unsigned NOT NULL,
       grouper int(10) unsigned NOT NULL,
  value varchar(45) NOT NULL,
 PRIMARY KEY (id),
   KEY ix_limiter_grouper_id (grouper,id)
      )
ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT *
FROM   (
       SELECT  l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
       FROM    t_limiter l
       ) lo
WHERE  rn = 1

MySQL of course lacks this function, but it can be easily emulated using:

SELECT  *
FROM    (
        SELECT  grouper, MIN(id) AS mid
        FROM    t_limiter
        GROUP BY
                grouper
        ) dl, t_limiter l
WHERE   l.grouper = dl.grouper
        AND l.id = mid

That's fine, we get exactly one row for each grouper. But what is we need 15 rows for each month?

With analytical functions, it's still simple:

SELECT *
FROM   (
       SELECT  l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
       FROM    t_limiter l
       ) lo
WHERE  rn <= 15

In MySQL we can do, umm…

Grouping won't help us in this case, as it always returns 1 row for each grouper, and we need 15 rows for a join. Same with correlated subqueries: they return 1 row.

In this case, we can emulate analytical functions using tricks with session variables.

Let's populate the table and see how the query works.

We will create 999,999 records with grouper = id % 32, which will give us grouper's from 0 to 31, and 1 more record with grouper = 32. This final record will have id of 1,000,000.

Table population details

First, we need to select everything from the table, ordered first by grouper then by id (like analytics functions do). Session variables expressions are then being evaluated in the WHERE clause of the query.

The internal counter @r, being initially set to the number or rows we want to select for each grouper, is decremented on each row, and the when it falls under zero, the rows start being filtered out.

Whenever the grouper value changes (thus indicating that the new PARTITION started), the value of @r is reset back to limit.

SELECT  l.*
FROM    (
        SELECT  @lim := 15,
                @cg := -1
        ) vars,
        t_limiter l
WHERE   CASE WHEN @cg <> grouper THEN @r := @lim ELSE 1 END > 0
        AND (@r := @r - 1) >= 0
        AND (@cg := grouper) IS NOT NULL
ORDER BY
        grouper, id

Being run, the query returns the following:

id grouper value
32 0 Value 32
64 0 Value 64
447 31 Value 447
479 31 Value 479
1000000 32 Value 1000000
481 rows fetched in 0,0044s (3,4241s)

As you can see, the query above works and completes in Θ(n) time. When you have lots of groups and just a few items in each group, that's exactly what you need.

To be continued.

Written by Quassnoi

March 5th, 2009 at 9:00 pm

Posted in MySQL

2 Responses to 'Row sampling'

Subscribe to comments with RSS or TrackBack to 'Row sampling'.

  1. Thank you for explanation! I tried a slightly different variant of it and it seems that the row sampling with variables is somehow dependent on the physical order of records. I tried:

    SELECT l.*
    FROM (
    SELECT @lim := 1,
    @cg := -1
    ) vars, (SELECT * FROM group_limit ORDER BY rank DESC) l
    WHERE CASE WHEN @cg grouper THEN @r := @lim ELSE 1 END > 0
    AND (@r := @r – 1) >= 0
    AND (@cg := grouper) IS NOT NULL
    ORDER BY grouper

    without success (I got too many results). Any hints?

    prinz

    9 Nov 12 at 13:17

  2. @prinz: session variables are not generally to be used like this. They are undocumented and their behaviour changes between even minor releases. I only keep this post for historical reasons.

    Quassnoi

    9 Nov 12 at 14:05

Leave a Reply