MySQL: difference between sets at EXPLAIN EXTENDED

I have a table that holds data about items that existed at a certain time (regular snapshots taken).

Simple example:

Timestamp ID

1 A

1 B

2 A

2 B

2 C

3 A

3 D

4 D

4 E

In this case, item C gets created sometime between snapshot 1 and 2, sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.

The table is reasonably large (millions of records) and for each timestamp there are about 50 records.

What's the most efficient way of selecting the item ids for items that disappear between two consecutive timestamps?

So for the above example I would like to get the following:

Previous snapshot Current snapshot Removed

1 2 NULL

2 3 B, C

3 4 A

If it doesn't make the query inefficient, can it be extended to automatically use the latest (i. e. MAX) timestamp and the previous one?

Timestamp	ID
1	A
1	B
2	A
2	B
2	C
3	A
3	D
4	D
4	E

Previous snapshot	Current snapshot	Removed
1	2	NULL
2	3	B, C
3	4	A

We basically need to do the following things here:

Split the table into sets grouped by timestamp
Compare each set with the one of previous timestamp
Find the values missing in the current set and concatenate them

This is possible to do using only the standard ANSI SQL operators, however, this will be inefficient in MySQL.

Let's create a sample table and see how to work around this:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_set (
        ts INT NOT NULL,
        code CHAR(1) NOT NULL,
        PRIMARY KEY pk_set_ts_code (ts, code)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(10000);
COMMIT;

INSERT
INTO    t_set
SELECT  ft.id * 29 + RAND(20090819) * 10, CHAR(64 + fs.id)
FROM    filler ft
JOIN    (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 25
        ) fs
ON      fs.id <= 25 - ((ft.id - 1) div 400);
&#91;/sourcecode&#93;
</div>

This table contains <strong>130,000</strong> records: <strong>10,000</strong> sets having from <strong>26</strong> to <strong>1</strong> records in each.

The timestamps are deliberately made not consecutive.

To find out the list of timestamps, we issue the following query:


SELECT  DISTINCT ts
FROM    t_set

See the results and the plan

ts
36
64
88
120
153

289950
289974
290008
10000 rows fetched in 0.1350s (0.1265s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_set	range		PRIMARY	4		133168	100.00	Using index for group-by

, which uses INDEX FOR GROUP-BY and completes in but 126 milliseconds.

However, it we try to select the previous value of timestamp along with each timestamp using a subquery with MAX:

SELECT  (
        SELECT  MAX(ts)
        FROM    t_set si
        WHERE   si.ts < so.ts
        ) AS pts,
        ts
FROM    (
        SELECT  DISTINCT ts
        FROM    t_set
        LIMIT 100
        ) so
&#91;/sourcecode&#93;

, we will get the following results:

<a href="#" onclick="xcollapse('X7849');return false;"><strong>See the results and the plan</strong></a>
<br />
<div id="X7849" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>pts</th><th>ts</th></tr>
<tr><td class="integer"></td><td class="integer">36</td></tr>
<tr><td class="integer">36</td><td class="integer">64</td></tr>
<tr><td class="integer">64</td><td class="integer">88</td></tr>
<tr class="break"><td colspan="100"></td></tr>
<tr><td class="integer">289916</td><td class="integer">289950</td></tr>
<tr><td class="integer">289950</td><td class="integer">289974</td></tr>
<tr><td class="integer">289974</td><td class="integer">290008</td></tr>
<tr class="statusbar"><td colspan="100">10000 rows fetched in 0.0021s (796.8578s)</td></tr>
</table>
</div>

<div class="terminal">
<table class="terminal">
<tr>
<th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>filtered</th><th>Extra</th></tr>
<tr><td class="bigint">1</td><td class="varchar">PRIMARY</td><td class="varchar">&lt;derived3&gt;</td><td class="varchar">ALL</td><td class="varchar"></td><td class="varchar"></td><td class="varchar"></td><td class="varchar"></td><td class="bigint">100</td><td class="double">100.00</td><td class="varchar"></td></tr>
<tr><td class="bigint">3</td><td class="varchar">DERIVED</td><td class="varchar">t_set</td><td class="varchar">index</td><td class="varchar"></td><td class="varchar">PRIMARY</td><td class="varchar">7</td><td class="varchar"></td><td class="bigint">133167</td><td class="double">16.67</td><td class="varchar">Using index; Using temporary</td></tr>
<tr><td class="bigint">2</td><td class="varchar">DEPENDENT SUBQUERY</td><td class="varchar">si</td><td class="varchar">index</td><td class="varchar">PRIMARY</td><td class="varchar">PRIMARY</td><td class="varchar">7</td><td class="varchar"></td><td class="bigint">133167</td><td class="double">100.00</td><td class="varchar">Using where; Using index</td></tr>
</table>
</div>
</div>

We see <code>USING WHERE</code> in the query plan for a subquery. This is because <strong>MySQL</strong> cannot use <strong>RANGE</strong> access method in subqueries.

This leads to rescanning the <em>whole</em> table for each distinct value of <code>ts</code>, which takes <strong>700 seconds</strong>.

To work around this, we just need to use a session variable to store the previous value of the <code>ts</code>:


SELECT  @r AS pts,
        @r := ts AS ts
FROM    (
        SELECT  ts
        FROM    (
                SELECT  @r := NULL
                ) vars,
                (
                SELECT  DISTINCT ts
                FROM    t_set
                ) soo
        ) so

See the results and the plan

pts	ts
	36
36	64
64	88

289916	289950
289950	289974
289974	290008
10000 rows fetched in 0.1958s (0.1464s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL			10000	100.00
2	DERIVED	<derived3>	system			1	100.00
2	DERIVED	<derived4>	ALL			10000	100.00
4	DERIVED	t_set	range	PRIMARY	4	22195	100.00	Using index for group-by
3	DERIVED							No tables used

This is very fast, just 140 milliseconds.

Fortunately, MySQL does use REF access (as opposed to RANGE access) in subqueries, that's why we can use the following query to obtain the result we're after:

SELECT  ts,
        (
        SELECT  GROUP_CONCAT(code)
        FROM    t_set si
        WHERE   si.ts =
                (
                SELECT  sp.ts
                FROM    t_set sp
                WHERE   sp.ts = so.pts
                ORDER BY
                        sp.ts DESC
                LIMIT 1
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    t_set sn
                WHERE   sn.ts = so.ts
                        AND sn.code = si.code
                )
        ) AS diff
FROM    (
        SELECT  @r AS pts,
                @r := ts AS ts
        FROM    (
                SELECT  @r := NULL
                ) vars,
                (
                SELECT  DISTINCT ts
                FROM    t_set
                ) soo
        ) so
HAVING  diff IS NOT NULL

See the results and the plan

ts	diff
11632	Y
23238	X
34835	W
46437	V
58033	U
69633	T
81231	S
92830	R
104434	Q
116039	P
127634	O
139232	N
150830	M
162435	L
174032	K
185634	J
197229	I
208830	H
220434	G
232038	F
243633	E
255237	D
266833	C
278430	B
24 rows fetched in 0.0006s (3.7499s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived5>	ALL					10000	100.00
5	DERIVED	<derived6>	system					1	100.00
5	DERIVED	<derived7>	ALL					10000	100.00
7	DERIVED	t_set	range		PRIMARY	4		22195	100.00	Using index for group-by
6	DERIVED									No tables used
2	DEPENDENT SUBQUERY	si	ref	PRIMARY	PRIMARY	4	func	6	100.00	Using where; Using index
4	DEPENDENT SUBQUERY	sn	eq_ref	PRIMARY	PRIMARY	7	so.ts,20090819_sets.si.code	1	100.00	Using where; Using index
3	DEPENDENT SUBQUERY	sp	ref	PRIMARY	PRIMARY	4	so.pts	6	100.00	Using where; Using index

, which completed in only 3.75 seconds.

Written by Quassnoi

August 19th, 2009 at 11:00 pm

Posted in MySQL

One Response to 'MySQL: difference between sets'

Subscribe to comments with RSS

Thank you, very usefull article.
I’m sorry but I don’t see any sense for the subquery (recs. 06-13)
( SELECT sp.ts
FROM t_set sp
WHERE sp.ts = so.pts
ORDER BY
sp.ts DESC
LIMIT 1 )
I replaced it with the construction “si.ts = so.pts” with the same output result. But may be I just not got the idea ?

Max

5 May 17 at 17:03

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED