EXPLAIN EXTENDED

How to create fast database queries

Archive for April 26th, 2009

Keeping latest rows for a group

with one comment

From Stack Overflow:

Assume I have a table foo where I have something like this:

id, user_id, timestamp, some_value

What I want to do is remove all rows that aren't the newest N per user.

The deletion itself could be handled by a query like this:

DELETE
FROM    foo
WHERE   id NOT IN (...)

so you could rephrase the problem into this: how do I get the newest N (there might be less) rows for each user.

This means if I have U users I may end up with N×U rows, so LIMIT won't really work.

Unfortunately, NOT IN will not work here. MySQL doesn't allow using the target table in an IN or NOT IN clause in a DELETE or UPDATE statement.

But some time ago I posted two solutions:

Now it's time to use these solutions together.

What we need to do here is:

  1. Select DISTINCT user_id's from the table
  2. In a subquery, SELECT n'th timestamp and n'th id for each user_id. We need to select both these fields to handle the possible duplicates of timestamp's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in a SELECT clause, that's why we need two subselects (ugly, I know)
  3. Join the resultset with original table, for each user_id selecting all rows with (timestamp, id) lesser than the (n'th timestamp, n'th id) selected above
  4. DELETE from the joined table. Unlike IN and NOT IN conditions, using the target table in a JOIN is OK for MySQL

Now, let's create the sample table and see how it works:

Table creation details

This table has 10,000 records for 5 users.

Now, let's try to delete all records except the latest 3 for each user_id:

DELETE  l.*
FROM    foo l
JOIN    (
SELECT  user_id,
COALESCE(
(
SELECT  timestamp
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC
LIMIT 2, 1
), CAST('0001-01-01' AS DATETIME)) AS mts,
COALESCE(
(
SELECT  id
FROM    foo li
WHERE   li.user_id = dlo.user_id
ORDER BY
li.user_id DESC, li.timestamp DESC, li.id DESC
LIMIT 2, 1
), -1) AS mid
FROM    (
SELECT  DISTINCT user_id
FROM    foo dl
) dlo
) lo
ON      l.user_id = lo.user_id
AND (l.timestamp, l.id) < (mts, mid)

We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id's

Let's see what's left after the query finishes:

SELECT  *
FROM    foo
ORDER BY
user_id, timestamp, id
id user_id timestamp some_value
5950 1 2009-04-25 23:56:41 Value 5950
935 1 2009-04-25 23:58:52 Value 935
95 1 2009-04-25 23:59:07 Value 95
8461 2 2009-04-25 23:57:18 Value 8461
4631 2 2009-04-25 23:58:14 Value 4631
6626 2 2009-04-25 23:59:10 Value 6626
7487 3 2009-04-25 23:55:41 Value 7487
147 3 2009-04-25 23:56:08 Value 147
27 3 2009-04-25 23:59:49 Value 27
2698 4 2009-04-25 23:59:17 Value 2698
1868 4 2009-04-25 23:59:40 Value 1868
7938 4 2009-04-25 23:59:56 Value 7938
8929 5 2009-04-25 23:57:59 Value 8929
1934 5 2009-04-25 23:58:19 Value 1934
6179 5 2009-04-25 23:58:22 Value 6179
15 rows fetched in 0.0005s (0.0020s)

Just like requested: 15 records left, 3 latest records for each user_id.

Written by Quassnoi

April 26th, 2009 at 11:00 pm

Posted in MySQL