EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

MySQL: difference between sets

From Stack Overflow:

I have a table that holds data about items that existed at a certain time (regular snapshots taken).

Simple example:

Timestamp ID
1 A
1 B
2 A
2 B
2 C
3 A
3 D
4 D
4 E

In this case, item C gets created sometime between snapshot 1 and 2, sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.

The table is reasonably large (millions of records) and for each timestamp there are about 50 records.

What’s the most efficient way of selecting the item ids for items that disappear between two consecutive timestamps?

So for the above example I would like to get the following:

Previous snapshot Current snapshot Removed
1 2 NULL
2 3 B, C
3 4 A

If it doesn’t make the query inefficient, can it be extended to automatically use the latest (i. e. MAX) timestamp and the previous one?

We basically need to do the following things here:

  1. Split the table into sets grouped by timestamp
  2. Compare each set with the one of previous timestamp
  3. Find the values missing in the current set and concatenate them

This is possible to do using only the standard ANSI SQL operators, however, this will be inefficient in MySQL.

Let’s create a sample table and see how to work around this:


Table creation details

This table contains 130,000 records: 10,000 sets having from 26 to 1 records in each.

The timestamps are deliberately made not consecutive.

To find out the list of timestamps, we issue the following query:

SELECT  DISTINCT ts
FROM    t_set

See the results and the plan

, which uses INDEX FOR GROUP-BY and completes in but 126 milliseconds.

However, it we try to select the previous value of timestamp along with each timestamp using a subquery with MAX:

SELECT  (
        SELECT  MAX(ts)
        FROM    t_set si
        WHERE   si.ts < so.ts
        ) AS pts,
        ts
FROM    (
        SELECT  DISTINCT ts
        FROM    t_set
        LIMIT 100
        ) so

, we will get the following results:

See the results and the plan

We see USING WHERE in the query plan for a subquery. This is because MySQL cannot use RANGE access method in subqueries.

This leads to rescanning the whole table for each distinct value of ts, which takes 700 seconds.

To work around this, we just need to use a session variable to store the previous value of the ts:

SELECT  @r AS pts,
        @r := ts AS ts
FROM    (
        SELECT  ts
        FROM    (
                SELECT  @r := NULL
                ) vars,
                (
                SELECT  DISTINCT ts
                FROM    t_set
                ) soo
        ) so

See the results and the plan

This is very fast, just 140 milliseconds.

Fortunately, MySQL does use REF access (as opposed to RANGE access) in subqueries, that’s why we can use the following query to obtain the result we’re after:

SELECT  ts,
        (
        SELECT  GROUP_CONCAT(code)
        FROM    t_set si
        WHERE   si.ts =
                (
                SELECT  sp.ts
                FROM    t_set sp
                WHERE   sp.ts = so.pts
                ORDER BY
                        sp.ts DESC
                LIMIT 1
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    t_set sn
                WHERE   sn.ts = so.ts
                        AND sn.code = si.code
                )
        ) AS diff
FROM    (
        SELECT  @r AS pts,
                @r := ts AS ts
        FROM    (
                SELECT  @r := NULL
                ) vars,
                (
                SELECT  DISTINCT ts
                FROM    t_set
                ) soo
        ) so
HAVING  diff IS NOT NULL

See the results and the plan

, which completed in only 3.75 seconds.

Written by Quassnoi

August 19th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.