Latest distinct IP's ordered by time at EXPLAIN EXTENDED

I have a table containing pagehit data in MySQL.

How do I grab the 10 latest unique IP's from that table?

This is quite a common task and there are two approaches to solve it.

Let's create two sample tables that differ in IP distribution:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_latest (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        ts DATETIME NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_latest_ts (ts),
        KEY ix_latest_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_latest2 (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        ts DATETIME NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_latest2_ts (ts),
        KEY ix_latest2_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_latest (id, ip, ts, value)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090608) * 0x1000),
        CAST('2009-06-08' AS DATETIME) - INTERVAL id SECOND,
        CONCAT('Record ', id)
FROM    filler;

INSERT
INTO    t_latest2 (id, ip, ts, value)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090608) * 10),
        CAST('2009-06-08' AS DATETIME) - INTERVAL id SECOND,
        CONCAT('Record ', id)
FROM    filler;
&#91;/sourcecode&#93;

These two tables, <code>t_latest</code> and <code>t_latest2</code>, contain <strong>100,000</strong> rows each.

In <code>t_latest</code>, <code>IP</code>'s are distributed randomly with <strong>65,536</strong> possible values; in <code>t_latest2</code> same but with <strong>10</strong> possible values.

In both tables, there are indexes on <code>(ts)</code> and on <code>(ip, ts)</code>.

Now, how do we query for <strong>10</strong> latest distinct <code>IP</code>'s?

We have two options here:

<ol>
	<li>Take the time of the latest entry for each <code>IP</code>, then order the resultset by time in descending order and take first <strong>10</strong> values
</li>
	<li>Order all values by <code>ts DESC</code>, and take first <strong>10</strong> values that don't have earlier entries with same <code>IP</code></li>
</ol>

To implement solution <strong>1</strong>, we can just select <code>MAX(ts)</code> for each <code>ip</code>, order by <code>MAX(ts)</code> and take first <strong>10</strong> values using a <code>LIMIT</code> clause.

Due to the <a href="http://bugs.mysql.com/bug.php?id=45373"><strong>bug 45373</strong></a> in <strong>MySQL</strong>, we will need to wrap the <code>GROUP BY</code> into a subquery for efficient <code>INDEX FOR GROUP BY</code> method to be usable.

Here's the query on the table with few <code>IP</code>'s:


SELECT  *
FROM    (
        SELECT  INET_NTOA(ip) AS textip, MAX(ts) AS maxts
        FROM    t_latest2
        GROUP BY
                ip
        ) q
ORDER BY
        maxts DESC
LIMIT 10

textip	maxts
10.0.0.9	2009-06-07 23:59:59
10.0.0.7	2009-06-07 23:59:58
10.0.0.3	2009-06-07 23:59:56
10.0.0.8	2009-06-07 23:59:55
10.0.0.4	2009-06-07 23:59:54
10.0.0.5	2009-06-07 23:59:53
10.0.0.1	2009-06-07 23:59:51
10.0.0.6	2009-06-07 23:59:40
10.0.0.2	2009-06-07 23:59:38
10.0.0.0	2009-06-07 23:59:36
10 rows fetched in 0.0003s (0.0017s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					10	100.00	Using filesort
2	DERIVED	t_latest2	range		ix_latest2_ip_ts	4		19	100.00	Using index for group-by

Same query on the table with lots of IP's:

SELECT  *
FROM    (
        SELECT  INET_NTOA(ip) AS textip, MAX(ts) AS maxts
        FROM    t_latest
        GROUP BY
                ip
        ) q
ORDER BY
        maxts DESC
LIMIT 10

textip	maxts
10.0.238.37	2009-06-07 23:59:59
10.0.189.96	2009-06-07 23:59:58
10.0.232.111	2009-06-07 23:59:57
10.0.82.15	2009-06-07 23:59:56
10.0.224.254	2009-06-07 23:59:55
10.0.110.202	2009-06-07 23:59:54
10.0.134.250	2009-06-07 23:59:53
10.0.86.129	2009-06-07 23:59:52
10.0.27.154	2009-06-07 23:59:51
10.0.134.128	2009-06-07 23:59:50
10 rows fetched in 0.0003s (1.8750s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					65536	100.00	Using filesort
2	DERIVED	t_latest	range		ix_latest_ip_ts	4		142943	100.00	Using index for group-by

As you can see, this query is instant for the table with few distinct IP's but takes almost 2 seconds when there are lots of IP's.

The reason is obvious: the latter query needs to calculate and sort much more aggregates.

Each aggregate is almost instant, since it uses a single index lookup to find out the MAX value, so the number of aggregated records in each group doesn't influence the query time.

It's only the number of groups that counts, and it's huge for the second query. since there are lots of distinct IP's.

For the latter table, it's better to use the solution 2. We need to sort the records by ts DESC, and for each record we will check that there are no records with same IP but greater ts.

This guarantees that for each IP, only the highest ts will be selected.

Of course, the duplicates are possible, but they can be easily removed by using DISTINCT clause.

Here's the query for the table with few IP's:

SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest2 lo
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10

textip	ts
10.0.0.9	2009-06-07 23:59:59
10.0.0.7	2009-06-07 23:59:58
10.0.0.3	2009-06-07 23:59:56
10.0.0.8	2009-06-07 23:59:55
10.0.0.4	2009-06-07 23:59:54
10.0.0.5	2009-06-07 23:59:53
10.0.0.1	2009-06-07 23:59:51
10.0.0.6	2009-06-07 23:59:40
10.0.0.2	2009-06-07 23:59:38
10.0.0.0	2009-06-07 23:59:36
10 rows fetched in 0.0003s (1.8906s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lo	index		ix_latest2_ts	8		10	9979060.00	Using where; Using temporary
2	DEPENDENT SUBQUERY	li	ref	ix_latest2_ts,ix_latest2_ip_ts	ix_latest2_ip_ts	4	20090608_latest.lo.ip	55439	100.00	Using where; Using index

, and same for the table with lots of IP's:

SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest lo
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    t_latest li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10

textip	ts
10.0.238.37	2009-06-07 23:59:59
10.0.189.96	2009-06-07 23:59:58
10.0.232.111	2009-06-07 23:59:57
10.0.82.15	2009-06-07 23:59:56
10.0.224.254	2009-06-07 23:59:55
10.0.110.202	2009-06-07 23:59:54
10.0.134.250	2009-06-07 23:59:53
10.0.86.129	2009-06-07 23:59:52
10.0.27.154	2009-06-07 23:59:51
10.0.134.128	2009-06-07 23:59:50
10 rows fetched in 0.0003s (0.0019s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lo	index		ix_latest_ts	8		10	10005940.00	Using where; Using temporary
2	DEPENDENT SUBQUERY	li	ref	ix_latest_ts,ix_latest_ip_ts	ix_latest_ip_ts	4	20090608_latest.lo.ip	7	100.00	Using where; Using index

The situation is diametrically opposite to that of the GROUP BY queries: the query is very slow for the table with few IP's and is almost instant when there are lots of IP's.

This is because for this query, the EXISTS subquery is executed for each IP met so far and involves an index scan over all records for this IP.

If there are few IP's, there are lots of records for each of these IP's, and the subquery takes a long time to complete.

If there are lot's of IP's, each subquery is almost instant.

And since there is a very low probablility of meeting the same IP twice, all 10 records most probably will be selected in 10 iterations.

But which approach should be used?

Since real world IP's are quite numerous, in a real pagehit table I'd use the second approach (with EXISTS).

However, it will only be efficient if there is an index on both (ts) and (ip, ts).

Written by Quassnoi

June 8th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Latest DISTINCT records

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow