MySQL: difference between sets
From Stack Overflow:
I have a table that holds data about items that existed at a certain time (regular snapshots taken).
Simple example:
Timestamp ID 1 A 1 B 2 A 2 B 2 C 3 A 3 D 4 D 4 E In this case, item
Cgets created sometime between snapshot 1 and 2, sometime between snapshot 2 and 3BandCdisappear andDgets created, etc.The table is reasonably large (millions of records) and for each timestamp there are about 50 records.
What’s the most efficient way of selecting the item
ids for items that disappear between two consecutive timestamps?So for the above example I would like to get the following:
Previous snapshot Current snapshot Removed 1 2 NULL 2 3 B, C 3 4 A If it doesn’t make the query inefficient, can it be extended to automatically use the latest (i. e.
MAX) timestamp and the previous one?
We basically need to do the following things here:
- Split the table into sets grouped by timestamp
- Compare each set with the one of previous timestamp
- Find the values missing in the current set and concatenate them
This is possible to do using only the standard ANSI SQL operators, however, this will be inefficient in MySQL.
Let’s create a sample table and see how to work around this:
This table contains 130,000 records: 10,000 sets having from 26 to 1 records in each.
The timestamps are deliberately made not consecutive.
To find out the list of timestamps, we issue the following query:
SELECT DISTINCT ts FROM t_set
, which uses INDEX FOR GROUP-BY and completes in but 126 milliseconds.
However, it we try to select the previous value of timestamp along with each timestamp using a subquery with MAX:
SELECT (
SELECT MAX(ts)
FROM t_set si
WHERE si.ts < so.ts
) AS pts,
ts
FROM (
SELECT DISTINCT ts
FROM t_set
LIMIT 100
) so
, we will get the following results:
We see USING WHERE in the query plan for a subquery. This is because MySQL cannot use RANGE access method in subqueries.
This leads to rescanning the whole table for each distinct value of ts, which takes 700 seconds.
To work around this, we just need to use a session variable to store the previous value of the ts:
SELECT @r AS pts,
@r := ts AS ts
FROM (
SELECT ts
FROM (
SELECT @r := NULL
) vars,
(
SELECT DISTINCT ts
FROM t_set
) soo
) so
This is very fast, just 140 milliseconds.
Fortunately, MySQL does use REF access (as opposed to RANGE access) in subqueries, that’s why we can use the following query to obtain the result we’re after:
SELECT ts,
(
SELECT GROUP_CONCAT(code)
FROM t_set si
WHERE si.ts =
(
SELECT sp.ts
FROM t_set sp
WHERE sp.ts = so.pts
ORDER BY
sp.ts DESC
LIMIT 1
)
AND NOT EXISTS
(
SELECT NULL
FROM t_set sn
WHERE sn.ts = so.ts
AND sn.code = si.code
)
) AS diff
FROM (
SELECT @r AS pts,
@r := ts AS ts
FROM (
SELECT @r := NULL
) vars,
(
SELECT DISTINCT ts
FROM t_set
) soo
) so
HAVING diff IS NOT NULL
, which completed in only 3.75 seconds.
Subscribe in a reader