20 latest unique records at EXPLAIN EXTENDED

I have a logfile which logs the insert/delete/updates from all kinds of tables.

I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (datetime DESC)

A common solution for such a task would be writing an aggregate query with ORDER BY and LIMIT:

SELECT  person, MAX(ts) AS last_update
FROM    logfile
GROUP BY
person
ORDER BY
last_update DESC
LIMIT 20

What's bad in this solution? Performance, as usual.

Since last_update is an aggregate, it cannot be indexed. And ORDER BY on unindexed fields results in our good old friend, filesort.

Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts), MySQL will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL.

We should use another approach here. Let's create a sample table and test this approach:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE logfile (
id INT NOT NULL PRIMARY KEY,
sparse INT NOT NULL,
dense INT NOT NULL,
ts DATETIME NOT NULL,
stuffing VARCHAR(100) NOT NULL,
KEY ix_logfile_ts_id (ts, id),
KEY ix_logfile_sparse_ts_id (sparse, ts, id),
KEY ix_logfile_dense_ts_id (dense, ts, id)
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    logfile
SELECT  id,
        CEILING(RAND(20100824) * 30),
        CEILING(RAND(20100824 &lt;&lt; 1) * 30000),
        '2010-08-24' - INTERVAL RAND(20100824 &lt;&lt; 2) * 10000000 SECOND,
        LPAD('', 100, '*')
FROM    filler;

This table has 1,000,000 records.

Instead of a single field, person, I created two different fields: sparse and dense. The first one has 30 distinct values, while the second one has 30,000. This will help us to see how data distribution affects performance of different queries.

Let's run our original queries. We'll adjust them a little to help MySQL to pick correct plans:

SELECT  *
FROM    (
SELECT  sparse, MAX(ts) AS last_update
FROM    logfile
GROUP BY
sparse
) q
ORDER BY
last_update DESC
LIMIT 20;

View query results

sparse	last_update
15	2010-08-23 23:59:58
26	2010-08-23 23:59:56
11	2010-08-23 23:59:56
30	2010-08-23 23:59:42
29	2010-08-23 23:59:32
13	2010-08-23 23:58:54
27	2010-08-23 23:58:53
7	2010-08-23 23:58:46
5	2010-08-23 23:58:00
12	2010-08-23 23:57:44
14	2010-08-23 23:57:24
6	2010-08-23 23:56:58
2	2010-08-23 23:56:48
24	2010-08-23 23:56:13
17	2010-08-23 23:56:12
23	2010-08-23 23:55:08
19	2010-08-23 23:55:07
20	2010-08-23 23:53:44
10	2010-08-23 23:51:52
4	2010-08-23 23:50:53
20 rows fetched in 0.0005s (0.0026s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					30	100.00	Using filesort
2	DERIVED	logfile	range		ix_logfile_sparse_ts_id	4		500133	100.00	Using index for group-by

select `q`.`sparse` AS `sparse`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`sparse` AS `sparse`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`sparse`) `q` order by `q`.`last_update` desc limit 20

SELECT  *
FROM    (
SELECT  dense, MAX(ts) AS last_update
FROM    logfile
GROUP BY
dense
) q
ORDER BY
last_update DESC
LIMIT 20;

View query results

dense	last_update
25324	2010-08-23 23:59:58
13060	2010-08-23 23:59:56
3268	2010-08-23 23:59:56
2327	2010-08-23 23:59:42
23968	2010-08-23 23:59:32
1622	2010-08-23 23:58:54
29693	2010-08-23 23:58:53
655	2010-08-23 23:58:46
5802	2010-08-23 23:58:07
11843	2010-08-23 23:58:00
18894	2010-08-23 23:57:44
6180	2010-08-23 23:57:26
9398	2010-08-23 23:57:24
18012	2010-08-23 23:56:58
25758	2010-08-23 23:56:49
2379	2010-08-23 23:56:48
821	2010-08-23 23:56:39
4186	2010-08-23 23:56:13
20198	2010-08-23 23:56:12
18615	2010-08-23 23:56:01
20 rows fetched in 0.0005s (0.5000s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					30000	100.00	Using filesort
2	DERIVED	logfile	range		ix_logfile_dense_ts_id	4		500133	100.00	Using index for group-by

select `q`.`dense` AS `dense`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`dense` AS `dense`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`dense`) `q` order by `q`.`last_update` desc limit 20

We see that both queries use the same plan and return 20 records, but the first one is instant, while the second one runs for 500 ms. Both queries use filesort, but in second case it has to sort 30,000 records (compared to 30 in the first case).

In this case, it is better to use another approach.

With our original query, we take each person and see which record is latest for this person. But we can as well do it the other way round: take the records in descending order, one by one, and for each record see if it's latest for this person. If it is, we should return it; if it's not, this means that the record for this person has already been returned (remember, we take them in descending order).

It's easy to see that 20 records returned this way will, first, belong to 20 different people, and, second, be the latest records of their respective persons. This is exactly what we need.

The records can easily be scanned in the descending order using the index on (ts, id). But how do we check if the record is the latest? It's simple: we just take the last record for the given person from the index on (person, ts, id) and compare its id. It takes but a single index seek per record and is almost instant.

Here's the query to do it:

SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
(
SELECT  id
FROM    logfile lfi
WHERE   lfi.sparse = lf.sparse
ORDER BY
sparse DESC, ts DESC, id DESC
LIMIT 1
)
ORDER BY
ts DESC, id DESC
LIMIT 20

id	sparse	dense	ts
121946	15	25324	2010-08-23 23:59:58
276499	11	3268	2010-08-23 23:59:56
62419	26	13060	2010-08-23 23:59:56
254750	30	2327	2010-08-23 23:59:42
96079	29	23968	2010-08-23 23:59:32
290657	13	1622	2010-08-23 23:58:54
278842	27	29693	2010-08-23 23:58:53
329318	7	655	2010-08-23 23:58:46
384956	5	11843	2010-08-23 23:58:00
386333	12	18894	2010-08-23 23:57:44
260404	14	9398	2010-08-23 23:57:24
471000	6	18012	2010-08-23 23:56:58
172079	2	2379	2010-08-23 23:56:48
112653	24	4186	2010-08-23 23:56:13
291683	17	20198	2010-08-23 23:56:12
144673	23	25055	2010-08-23 23:55:08
172118	19	29039	2010-08-23 23:55:07
198913	20	9887	2010-08-23 23:53:44
491436	10	17752	2010-08-23 23:51:52
346651	4	10951	2010-08-23 23:50:53
20 rows fetched in 0.0007s (0.0034s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lf	index		ix_logfile_ts_id	12		20	2500660.00	Using where
2	DEPENDENT SUBQUERY	lfi	ref	ix_logfile_sparse_ts_id	ix_logfile_sparse_ts_id	4	20100824_latest.lf.sparse	27785	100.00	Using where; Using index

Field or reference '20100824_latest.lf.sparse' of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`sparse` = `20100824_latest`.`lf`.`sparse`) order by `20100824_latest`.`lfi`.`sparse` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20

As we can see, this query uses two different indexes. The first one on (ts, id) is used to scan all records according to the overall timeline; the second one, on (sparse, ts, id) is used to find the id of the latest entry for a person and check if it's the same as the record selected from the general timeline.

The query is instant: 3 ms.

Let's check the same query on a column with lots of values:

SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
(
SELECT  id
FROM    logfile lfi
WHERE   lfi.dense = lf.dense
ORDER BY
dense DESC, ts DESC, id DESC
LIMIT 1
)
ORDER BY
ts DESC, id DESC
LIMIT 20

id	sparse	dense	ts
121946	15	25324	2010-08-23 23:59:58
276499	11	3268	2010-08-23 23:59:56
62419	26	13060	2010-08-23 23:59:56
254750	30	2327	2010-08-23 23:59:42
96079	29	23968	2010-08-23 23:59:32
290657	13	1622	2010-08-23 23:58:54
278842	27	29693	2010-08-23 23:58:53
329318	7	655	2010-08-23 23:58:46
277612	15	5802	2010-08-23 23:58:07
384956	5	11843	2010-08-23 23:58:00
386333	12	18894	2010-08-23 23:57:44
201899	7	6180	2010-08-23 23:57:26
260404	14	9398	2010-08-23 23:57:24
471000	6	18012	2010-08-23 23:56:58
451808	26	25758	2010-08-23 23:56:49
172079	2	2379	2010-08-23 23:56:48
367042	11	821	2010-08-23 23:56:39
112653	24	4186	2010-08-23 23:56:13
291683	17	20198	2010-08-23 23:56:12
127839	11	18615	2010-08-23 23:56:01
20 rows fetched in 0.0007s (0.0031s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lf	index		ix_logfile_ts_id	12		20	2500660.00	Using where
2	DEPENDENT SUBQUERY	lfi	ref	ix_logfile_dense_ts_id	ix_logfile_dense_ts_id	4	20100824_latest.lf.dense	8	100.00	Using where; Using index

Field or reference '20100824_latest.lf.dense' of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`dense` = `20100824_latest`.`lf`.`dense`) order by `20100824_latest`.`lfi`.`dense` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20

We see that the query is instant again, despite the data distribution being completely different. This is because the query only skips the records which are not the latest of their persons, and the total number of the records to scan is defined by how many records do we browse before we encounter the 20th unique value in our scan. This value decreases exponentially as the number of distinct persons in the table grows, but with 99% probability it won't exceed 100 records even for only 20 distinct persons in the table.

The only problem that can arise here is that the number of distinct persons in the table is less than the LIMIT we set. In this case, no new records after the limit is reached can be returned, and a full index scan (accompanied by an index seek once per record) will ultimately be performed.

To work around this, the following simple query can be run in advance:

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT sparse
FROM    logfile
LIMIT 20
) q

COUNT(*)
20
1 row fetched in 0.0001s (0.0015s)

View query details

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY									Select tables optimized away
2	DERIVED	logfile	range		ix_logfile_sparse_ts_id	4		19	100.00	Using index for group-by; Using temporary

select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`sparse` AS `sparse` from `20100824_latest`.`logfile` limit 20) `q`

This query will return the actual number of distinct persons in the table if there are less than 20 (or 20 if these are more).

This query is instant even for the dense data:

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT dense
FROM    logfile
LIMIT 20
) q

COUNT(*)
20
1 row fetched in 0.0001s (0.0024s)

View query details

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY									Select tables optimized away
2	DERIVED	logfile	index		ix_logfile_dense_ts_id	16		500132	12.50	Using index; Using temporary

select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`dense` AS `dense` from `20100824_latest`.`logfile` limit 20) `q`

This needs to be run as a separate query because MySQL does not allow using anything other than constants in the LIMIT clause. The result of this query should be substituted into the LIMIT clause on the client or in a dynamically composed query on the server.

Summary

To select a number of latest unique records from a table, one can use aggregate functions, however, this can decrease the query performance.

This can be done more efficiently by creating two different indexes on the table and checking the records taken from the general timeline against the end of the index on the person's timeline.

To avoid performance degradation in marginal cases (when the total number of persons in the table is less than LIMIT), it is possible to make an additional check for the total number of distinct records and adjust the LIMIT clause if there are not enough records.

P. S. I decided to enable comments for the technical posts as well. You are welcome to comment.

Written by Quassnoi

August 24th, 2010 at 11:00 pm

Posted in MySQL

3 Responses to '20 latest unique records'

Subscribe to comments with RSS

nice!

I don’t like the idea of creating indexes to make a single query run fast. Also, I do not know about others but I personally don’t get enough time to look at each query in such detail.

arun

25 Aug 10 at 23:06
@arun: in your opinion, how many queries does it take to make an index worth creating?

Quassnoi

26 Aug 10 at 00:50
This is excellent stuff indeed!

No hard feeling Arun, but I am just curious to know if you “don’t get enough time to look at each query in such detail”, then what do you think you should be spending your time on? you better play golf rather than mingling with databases!

vivek

1 Sep 10 at 11:54

EXPLAIN EXTENDED