20 latest unique records
From Stack Overflow:
I have a logfile which logs the insert/delete/updates from all kinds of tables.
I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (
datetime DESC
)
A common solution for such a task would be writing an aggregate query with ORDER BY
and LIMIT
:
SELECT person, MAX(ts) AS last_update FROM logfile GROUP BY person ORDER BY last_update DESC LIMIT 20
What's bad in this solution? Performance, as usual.
Since last_update
is an aggregate, it cannot be indexed. And ORDER BY
on unindexed fields results in our good old friend, filesort
.
Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts)
, MySQL
will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL
.
We should use another approach here. Let's create a sample table and test this approach:
Table creation details
This table has 1,000,000 records.
Instead of a single field, person
, I created two different fields: sparse
and dense
. The first one has 30 distinct values, while the second one has 30,000. This will help us to see how data distribution affects performance of different queries.
Let's run our original queries. We'll adjust them a little to help MySQL
to pick correct plans:
SELECT * FROM ( SELECT sparse, MAX(ts) AS last_update FROM logfile GROUP BY sparse ) q ORDER BY last_update DESC LIMIT 20;
SELECT * FROM ( SELECT dense, MAX(ts) AS last_update FROM logfile GROUP BY dense ) q ORDER BY last_update DESC LIMIT 20;
We see that both queries use the same plan and return 20 records, but the first one is instant, while the second one runs for 500 ms. Both queries use filesort, but in second case it has to sort 30,000 records (compared to 30 in the first case).
In this case, it is better to use another approach.
With our original query, we take each person and see which record is latest for this person. But we can as well do it the other way round: take the records in descending order, one by one, and for each record see if it's latest for this person. If it is, we should return it; if it's not, this means that the record for this person has already been returned (remember, we take them in descending order).
It's easy to see that 20 records returned this way will, first, belong to 20 different people, and, second, be the latest records of their respective persons. This is exactly what we need.
The records can easily be scanned in the descending order using the index on (ts, id)
. But how do we check if the record is the latest? It's simple: we just take the last record for the given person from the index on (person, ts, id)
and compare its id
. It takes but a single index seek per record and is almost instant.
Here's the query to do it:
SELECT id, sparse, dense, ts FROM logfile lf WHERE id = ( SELECT id FROM logfile lfi WHERE lfi.sparse = lf.sparse ORDER BY sparse DESC, ts DESC, id DESC LIMIT 1 ) ORDER BY ts DESC, id DESC LIMIT 20
id | sparse | dense | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
121946 | 15 | 25324 | 2010-08-23 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
276499 | 11 | 3268 | 2010-08-23 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
62419 | 26 | 13060 | 2010-08-23 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
254750 | 30 | 2327 | 2010-08-23 23:59:42 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
96079 | 29 | 23968 | 2010-08-23 23:59:32 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
290657 | 13 | 1622 | 2010-08-23 23:58:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
278842 | 27 | 29693 | 2010-08-23 23:58:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
329318 | 7 | 655 | 2010-08-23 23:58:46 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
384956 | 5 | 11843 | 2010-08-23 23:58:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
386333 | 12 | 18894 | 2010-08-23 23:57:44 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
260404 | 14 | 9398 | 2010-08-23 23:57:24 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
471000 | 6 | 18012 | 2010-08-23 23:56:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
172079 | 2 | 2379 | 2010-08-23 23:56:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
112653 | 24 | 4186 | 2010-08-23 23:56:13 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
291683 | 17 | 20198 | 2010-08-23 23:56:12 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
144673 | 23 | 25055 | 2010-08-23 23:55:08 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
172118 | 19 | 29039 | 2010-08-23 23:55:07 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
198913 | 20 | 9887 | 2010-08-23 23:53:44 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
491436 | 10 | 17752 | 2010-08-23 23:51:52 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
346651 | 4 | 10951 | 2010-08-23 23:50:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 rows fetched in 0.0007s (0.0034s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | lf | index | ix_logfile_ts_id | 12 | 20 | 2500660.00 | Using where | ||
2 | DEPENDENT SUBQUERY | lfi | ref | ix_logfile_sparse_ts_id | ix_logfile_sparse_ts_id | 4 | 20100824_latest.lf.sparse | 27785 | 100.00 | Using where; Using index |
Field or reference '20100824_latest.lf.sparse' of SELECT #2 was resolved in SELECT #1 select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`sparse` = `20100824_latest`.`lf`.`sparse`) order by `20100824_latest`.`lfi`.`sparse` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
As we can see, this query uses two different indexes. The first one on (ts, id)
is used to scan all records according to the overall timeline; the second one, on (sparse, ts, id)
is used to find the id
of the latest entry for a person and check if it's the same as the record selected from the general timeline.
The query is instant: 3 ms.
Let's check the same query on a column with lots of values:
SELECT id, sparse, dense, ts FROM logfile lf WHERE id = ( SELECT id FROM logfile lfi WHERE lfi.dense = lf.dense ORDER BY dense DESC, ts DESC, id DESC LIMIT 1 ) ORDER BY ts DESC, id DESC LIMIT 20
id | sparse | dense | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
121946 | 15 | 25324 | 2010-08-23 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
276499 | 11 | 3268 | 2010-08-23 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
62419 | 26 | 13060 | 2010-08-23 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
254750 | 30 | 2327 | 2010-08-23 23:59:42 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
96079 | 29 | 23968 | 2010-08-23 23:59:32 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
290657 | 13 | 1622 | 2010-08-23 23:58:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
278842 | 27 | 29693 | 2010-08-23 23:58:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
329318 | 7 | 655 | 2010-08-23 23:58:46 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
277612 | 15 | 5802 | 2010-08-23 23:58:07 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
384956 | 5 | 11843 | 2010-08-23 23:58:00 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
386333 | 12 | 18894 | 2010-08-23 23:57:44 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
201899 | 7 | 6180 | 2010-08-23 23:57:26 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
260404 | 14 | 9398 | 2010-08-23 23:57:24 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
471000 | 6 | 18012 | 2010-08-23 23:56:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
451808 | 26 | 25758 | 2010-08-23 23:56:49 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
172079 | 2 | 2379 | 2010-08-23 23:56:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
367042 | 11 | 821 | 2010-08-23 23:56:39 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
112653 | 24 | 4186 | 2010-08-23 23:56:13 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
291683 | 17 | 20198 | 2010-08-23 23:56:12 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
127839 | 11 | 18615 | 2010-08-23 23:56:01 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 rows fetched in 0.0007s (0.0031s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | lf | index | ix_logfile_ts_id | 12 | 20 | 2500660.00 | Using where | ||
2 | DEPENDENT SUBQUERY | lfi | ref | ix_logfile_dense_ts_id | ix_logfile_dense_ts_id | 4 | 20100824_latest.lf.dense | 8 | 100.00 | Using where; Using index |
Field or reference '20100824_latest.lf.dense' of SELECT #2 was resolved in SELECT #1 select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`dense` = `20100824_latest`.`lf`.`dense`) order by `20100824_latest`.`lfi`.`dense` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
We see that the query is instant again, despite the data distribution being completely different. This is because the query only skips the records which are not the latest of their persons, and the total number of the records to scan is defined by how many records do we browse before we encounter the 20th unique value in our scan. This value decreases exponentially as the number of distinct persons in the table grows, but with 99% probability it won't exceed 100 records even for only 20 distinct persons in the table.
The only problem that can arise here is that the number of distinct persons in the table is less than the LIMIT
we set. In this case, no new records after the limit is reached can be returned, and a full index scan (accompanied by an index seek once per record) will ultimately be performed.
To work around this, the following simple query can be run in advance:
SELECT COUNT(*) FROM ( SELECT DISTINCT sparse FROM logfile LIMIT 20 ) q
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0015s) |
This query will return the actual number of distinct persons in the table if there are less than 20 (or 20 if these are more).
This query is instant even for the dense data:
SELECT COUNT(*) FROM ( SELECT DISTINCT dense FROM logfile LIMIT 20 ) q
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0024s) |
This needs to be run as a separate query because MySQL does not allow using anything other than constants in the LIMIT
clause. The result of this query should be substituted into the LIMIT
clause on the client or in a dynamically composed query on the server.
Summary
To select a number of latest unique records from a table, one can use aggregate functions, however, this can decrease the query performance.
This can be done more efficiently by creating two different indexes on the table and checking the records taken from the general timeline against the end of the index on the person's timeline.
To avoid performance degradation in marginal cases (when the total number of persons in the table is less than LIMIT
), it is possible to make an additional check for the total number of distinct records and adjust the LIMIT
clause if there are not enough records.
P. S. I decided to enable comments for the technical posts as well. You are welcome to comment.
nice!
I don’t like the idea of creating indexes to make a single query run fast. Also, I do not know about others but I personally don’t get enough time to look at each query in such detail.
arun
25 Aug 10 at 23:06
@arun: in your opinion, how many queries does it take to make an index worth creating?
Quassnoi
26 Aug 10 at 00:50
This is excellent stuff indeed!
No hard feeling Arun, but I am just curious to know if you “don’t get enough time to look at each query in such detail”, then what do you think you should be spending your time on? you better play golf rather than mingling with databases!
vivek
1 Sep 10 at 11:54