EXPLAIN EXTENDED

How to create fast database queries

20 latest unique records

with 3 comments

From Stack Overflow:

I have a logfile which logs the insert/delete/updates from all kinds of tables.

I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (datetime DESC)

A common solution for such a task would be writing an aggregate query with ORDER BY and LIMIT:

SELECT  person, MAX(ts) AS last_update
FROM    logfile
GROUP BY
person
ORDER BY
last_update DESC
LIMIT 20

What's bad in this solution? Performance, as usual.

Since last_update is an aggregate, it cannot be indexed. And ORDER BY on unindexed fields results in our good old friend, filesort.

Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts), MySQL will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL.

We should use another approach here. Let's create a sample table and test this approach:

Table creation details

This table has 1,000,000 records.

Instead of a single field, person, I created two different fields: sparse and dense. The first one has 30 distinct values, while the second one has 30,000. This will help us to see how data distribution affects performance of different queries.

Let's run our original queries. We'll adjust them a little to help MySQL to pick correct plans:

SELECT  *
FROM    (
SELECT  sparse, MAX(ts) AS last_update
FROM    logfile
GROUP BY
sparse
) q
ORDER BY
last_update DESC
LIMIT 20;

View query results

SELECT  *
FROM    (
SELECT  dense, MAX(ts) AS last_update
FROM    logfile
GROUP BY
dense
) q
ORDER BY
last_update DESC
LIMIT 20;

View query results

We see that both queries use the same plan and return 20 records, but the first one is instant, while the second one runs for 500 ms. Both queries use filesort, but in second case it has to sort 30,000 records (compared to 30 in the first case).

In this case, it is better to use another approach.

With our original query, we take each person and see which record is latest for this person. But we can as well do it the other way round: take the records in descending order, one by one, and for each record see if it's latest for this person. If it is, we should return it; if it's not, this means that the record for this person has already been returned (remember, we take them in descending order).

It's easy to see that 20 records returned this way will, first, belong to 20 different people, and, second, be the latest records of their respective persons. This is exactly what we need.

The records can easily be scanned in the descending order using the index on (ts, id). But how do we check if the record is the latest? It's simple: we just take the last record for the given person from the index on (person, ts, id) and compare its id. It takes but a single index seek per record and is almost instant.

Here's the query to do it:

SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
(
SELECT  id
FROM    logfile lfi
WHERE   lfi.sparse = lf.sparse
ORDER BY
sparse DESC, ts DESC, id DESC
LIMIT 1
)
ORDER BY
ts DESC, id DESC
LIMIT 20

id sparse dense ts
121946 15 25324 2010-08-23 23:59:58
276499 11 3268 2010-08-23 23:59:56
62419 26 13060 2010-08-23 23:59:56
254750 30 2327 2010-08-23 23:59:42
96079 29 23968 2010-08-23 23:59:32
290657 13 1622 2010-08-23 23:58:54
278842 27 29693 2010-08-23 23:58:53
329318 7 655 2010-08-23 23:58:46
384956 5 11843 2010-08-23 23:58:00
386333 12 18894 2010-08-23 23:57:44
260404 14 9398 2010-08-23 23:57:24
471000 6 18012 2010-08-23 23:56:58
172079 2 2379 2010-08-23 23:56:48
112653 24 4186 2010-08-23 23:56:13
291683 17 20198 2010-08-23 23:56:12
144673 23 25055 2010-08-23 23:55:08
172118 19 29039 2010-08-23 23:55:07
198913 20 9887 2010-08-23 23:53:44
491436 10 17752 2010-08-23 23:51:52
346651 4 10951 2010-08-23 23:50:53
20 rows fetched in 0.0007s (0.0034s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY lf index ix_logfile_ts_id 12 20 2500660.00 Using where
2 DEPENDENT SUBQUERY lfi ref ix_logfile_sparse_ts_id ix_logfile_sparse_ts_id 4 20100824_latest.lf.sparse 27785 100.00 Using where; Using index
Field or reference '20100824_latest.lf.sparse' of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`sparse` = `20100824_latest`.`lf`.`sparse`) order by `20100824_latest`.`lfi`.`sparse` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20

As we can see, this query uses two different indexes. The first one on (ts, id) is used to scan all records according to the overall timeline; the second one, on (sparse, ts, id) is used to find the id of the latest entry for a person and check if it's the same as the record selected from the general timeline.

The query is instant: 3 ms.

Let's check the same query on a column with lots of values:

SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
(
SELECT  id
FROM    logfile lfi
WHERE   lfi.dense = lf.dense
ORDER BY
dense DESC, ts DESC, id DESC
LIMIT 1
)
ORDER BY
ts DESC, id DESC
LIMIT 20

id sparse dense ts
121946 15 25324 2010-08-23 23:59:58
276499 11 3268 2010-08-23 23:59:56
62419 26 13060 2010-08-23 23:59:56
254750 30 2327 2010-08-23 23:59:42
96079 29 23968 2010-08-23 23:59:32
290657 13 1622 2010-08-23 23:58:54
278842 27 29693 2010-08-23 23:58:53
329318 7 655 2010-08-23 23:58:46
277612 15 5802 2010-08-23 23:58:07
384956 5 11843 2010-08-23 23:58:00
386333 12 18894 2010-08-23 23:57:44
201899 7 6180 2010-08-23 23:57:26
260404 14 9398 2010-08-23 23:57:24
471000 6 18012 2010-08-23 23:56:58
451808 26 25758 2010-08-23 23:56:49
172079 2 2379 2010-08-23 23:56:48
367042 11 821 2010-08-23 23:56:39
112653 24 4186 2010-08-23 23:56:13
291683 17 20198 2010-08-23 23:56:12
127839 11 18615 2010-08-23 23:56:01
20 rows fetched in 0.0007s (0.0031s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY lf index ix_logfile_ts_id 12 20 2500660.00 Using where
2 DEPENDENT SUBQUERY lfi ref ix_logfile_dense_ts_id ix_logfile_dense_ts_id 4 20100824_latest.lf.dense 8 100.00 Using where; Using index
Field or reference '20100824_latest.lf.dense' of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`dense` = `20100824_latest`.`lf`.`dense`) order by `20100824_latest`.`lfi`.`dense` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20

We see that the query is instant again, despite the data distribution being completely different. This is because the query only skips the records which are not the latest of their persons, and the total number of the records to scan is defined by how many records do we browse before we encounter the 20th unique value in our scan. This value decreases exponentially as the number of distinct persons in the table grows, but with 99% probability it won't exceed 100 records even for only 20 distinct persons in the table.

The only problem that can arise here is that the number of distinct persons in the table is less than the LIMIT we set. In this case, no new records after the limit is reached can be returned, and a full index scan (accompanied by an index seek once per record) will ultimately be performed.

To work around this, the following simple query can be run in advance:

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT sparse
FROM    logfile
LIMIT 20
) q
COUNT(*)
20
1 row fetched in 0.0001s (0.0015s)

View query details

This query will return the actual number of distinct persons in the table if there are less than 20 (or 20 if these are more).

This query is instant even for the dense data:

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT dense
FROM    logfile
LIMIT 20
) q
COUNT(*)
20
1 row fetched in 0.0001s (0.0024s)

View query details

This needs to be run as a separate query because MySQL does not allow using anything other than constants in the LIMIT clause. The result of this query should be substituted into the LIMIT clause on the client or in a dynamically composed query on the server.

Summary

To select a number of latest unique records from a table, one can use aggregate functions, however, this can decrease the query performance.

This can be done more efficiently by creating two different indexes on the table and checking the records taken from the general timeline against the end of the index on the person's timeline.

To avoid performance degradation in marginal cases (when the total number of persons in the table is less than LIMIT), it is possible to make an additional check for the total number of distinct records and adjust the LIMIT clause if there are not enough records.

P. S. I decided to enable comments for the technical posts as well. You are welcome to comment.

Written by Quassnoi

August 24th, 2010 at 11:00 pm

Posted in MySQL

3 Responses to '20 latest unique records'

Subscribe to comments with RSS

  1. nice!

    I don’t like the idea of creating indexes to make a single query run fast. Also, I do not know about others but I personally don’t get enough time to look at each query in such detail.

    arun

    25 Aug 10 at 23:06

  2. @arun: in your opinion, how many queries does it take to make an index worth creating?

    Quassnoi

    26 Aug 10 at 00:50

  3. This is excellent stuff indeed!

    No hard feeling Arun, but I am just curious to know if you “don’t get enough time to look at each query in such detail”, then what do you think you should be spending your time on? you better play golf rather than mingling with databases!

    vivek

    1 Sep 10 at 11:54

Leave a Reply