EXPLAIN EXTENDED

How to create fast database queries

Row sampling

with 2 comments

Sometimes we need to get a sample row from a table satisfying a certain condition. Like, get a first row for each month.

MS SQL and Oracle supply analytical function ROW_NUMBER() for this purpose.

Let's create a simple table to illustrate our needs and see how do we query it.

CREATE TABLE t_limiter
   (
   id int(10) unsigned NOT NULL,
       grouper int(10) unsigned NOT NULL,
  value varchar(45) NOT NULL,
 PRIMARY KEY (id),
   KEY ix_limiter_grouper_id (grouper,id)
      )
ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT *
FROM   (
       SELECT  l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
       FROM    t_limiter l
       ) lo
WHERE  rn = 1

MySQL of course lacks this function, but it can be easily emulated using:

SELECT  *
FROM    (
        SELECT  grouper, MIN(id) AS mid
        FROM    t_limiter
        GROUP BY
                grouper
        ) dl, t_limiter l
WHERE   l.grouper = dl.grouper
        AND l.id = mid

That's fine, we get exactly one row for each grouper. But what is we need 15 rows for each month?

With analytical functions, it's still simple:

SELECT *
FROM   (
       SELECT  l.*, ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id DESC) AS rn
       FROM    t_limiter l
       ) lo
WHERE  rn <= 15
&#91;/sourcecode&#93;

In <code>MySQL</code> we can do, umm…

Grouping won't help us in this case, as it always returns <strong>1</strong> row for each <code>grouper</code>, and we need <strong>15</strong> rows for a join. Same with correlated subqueries: they return <strong>1</strong> row.

In this case, we can emulate analytical functions using tricks with session variables.

Let's populate the table and see how the query works.

We will create <strong>999,999</strong> records with <code>grouper = id % 32</code>, which will give us <code>grouper</code>'s from <strong>0</strong> to <strong>31</strong>, and <strong>1</strong> more record with <code>grouper = 32</code>. This final record will have <code>id</code> of <strong>1,000,000</strong>.


<a href="#" onclick="xcollapse('X10374');return false;"><strong>Table population details</strong></a>
<br />
<div id="X10374" style="display: none; ">

DELIMITER $$
CREATE PROCEDURE prc_fill_limiter(_cnt INT, _modulus INT)
BEGIN
        DECLARE i INT;
      SET i = 1;
  WHILE i < _cnt DO
           INSERT
              INTO    t_limiter (id, grouper, value)
              VALUES  (i, i % _modulus, CONCAT('Value ', i));
           SET i = i + 1;
      END WHILE;
  INSERT
      INTO    t_limiter (id, grouper, value)
      VALUES  (i, _modulus, CONCAT('Value ', i));
END;
$$
DELIMITER ;
START TRANSACTION;
CALL prc_fill_limiter (1000000, 32);
COMMIT;
&#91;/sourcecode&#93;
</div>

First, we need to select everything from the table, ordered first by <code>grouper</code> then by <code>id</code> (like analytics functions do). Session variables expressions are then being evaluated in the <code>WHERE</code> clause of the query.

The internal counter <code>@r</code>, being initially set to the number or rows we want to select for each <code>grouper</code>, is decremented on each row, and the when it falls under zero, the rows start being filtered out.

Whenever the <code>grouper</code> value changes (thus indicating that the new <code>PARTITION</code> started), the value of <code>@r</code> is reset back to limit.


SELECT  l.*
FROM    (
        SELECT  @lim := 15,
                @cg := -1
        ) vars,
        t_limiter l
WHERE   CASE WHEN @cg <> grouper THEN @r := @lim ELSE 1 END > 0
        AND (@r := @r - 1) >= 0
        AND (@cg := grouper) IS NOT NULL
ORDER BY
        grouper, id

Being run, the query returns the following:

id grouper value
32 0 Value 32
64 0 Value 64
447 31 Value 447
479 31 Value 479
1000000 32 Value 1000000
481 rows fetched in 0,0044s (3,4241s)

As you can see, the query above works and completes in Θ(n) time. When you have lots of groups and just a few items in each group, that's exactly what you need.

To be continued.

Written by Quassnoi

March 5th, 2009 at 9:00 pm

Posted in MySQL

2 Responses to 'Row sampling'

Subscribe to comments with RSS

  1. Thank you for explanation! I tried a slightly different variant of it and it seems that the row sampling with variables is somehow dependent on the physical order of records. I tried:

    SELECT l.*
    FROM (
    SELECT @lim := 1,
    @cg := -1
    ) vars, (SELECT * FROM group_limit ORDER BY rank DESC) l
    WHERE CASE WHEN @cg grouper THEN @r := @lim ELSE 1 END > 0
    AND (@r := @r – 1) >= 0
    AND (@cg := grouper) IS NOT NULL
    ORDER BY grouper

    without success (I got too many results). Any hints?

    prinz

    9 Nov 12 at 13:17

  2. @prinz: session variables are not generally to be used like this. They are undocumented and their behaviour changes between even minor releases. I only keep this post for historical reasons.

    Quassnoi

    9 Nov 12 at 14:05

Leave a Reply