EXPLAIN EXTENDED

How to create fast database queries

Latest DISTINCT records

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table containing pagehit data in MySQL.

How do I grab the 10 latest unique IP's from that table?

This is quite a common task and there are two approaches to solve it.

Let's create two sample tables that differ in IP distribution:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_latest (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        ts DATETIME NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_latest_ts (ts),
        KEY ix_latest_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_latest2 (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        ts DATETIME NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_latest2_ts (ts),
        KEY ix_latest2_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_latest (id, ip, ts, value)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090608) * 0x1000),
        CAST('2009-06-08' AS DATETIME) - INTERVAL id SECOND,
        CONCAT('Record ', id)
FROM    filler;

INSERT
INTO    t_latest2 (id, ip, ts, value)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090608) * 10),
        CAST('2009-06-08' AS DATETIME) - INTERVAL id SECOND,
        CONCAT('Record ', id)
FROM    filler;
&#91;/sourcecode&#93;

These two tables, <code>t_latest</code> and <code>t_latest2</code>, contain <strong>100,000</strong> rows each.

In <code>t_latest</code>, <code>IP</code>'s are distributed randomly with <strong>65,536</strong> possible values; in <code>t_latest2</code> same but with <strong>10</strong> possible values.

In both tables, there are indexes on <code>(ts)</code> and on <code>(ip, ts)</code>.

Now, how do we query for <strong>10</strong> latest distinct <code>IP</code>'s?

We have two options here:

<ol>
	<li>Take the time of the latest entry for each <code>IP</code>, then order the resultset by time in descending order and take first <strong>10</strong> values
</li>
	<li>Order all values by <code>ts DESC</code>, and take first <strong>10</strong> values that don't have earlier entries with same <code>IP</code></li>
</ol>

To implement solution <strong>1</strong>, we can just select <code>MAX(ts)</code> for each <code>ip</code>, order by <code>MAX(ts)</code> and take first <strong>10</strong> values using a <code>LIMIT</code> clause.

Due to the <a href="http://bugs.mysql.com/bug.php?id=45373"><strong>bug 45373</strong></a> in <strong>MySQL</strong>, we will need to wrap the <code>GROUP BY</code> into a subquery for efficient <code>INDEX FOR GROUP BY</code> method to be usable.

Here's the query on the table with few <code>IP</code>'s:


SELECT  *
FROM    (
        SELECT  INET_NTOA(ip) AS textip, MAX(ts) AS maxts
        FROM    t_latest2
        GROUP BY
                ip
        ) q
ORDER BY
        maxts DESC
LIMIT 10

textip maxts
10.0.0.9 2009-06-07 23:59:59
10.0.0.7 2009-06-07 23:59:58
10.0.0.3 2009-06-07 23:59:56
10.0.0.8 2009-06-07 23:59:55
10.0.0.4 2009-06-07 23:59:54
10.0.0.5 2009-06-07 23:59:53
10.0.0.1 2009-06-07 23:59:51
10.0.0.6 2009-06-07 23:59:40
10.0.0.2 2009-06-07 23:59:38
10.0.0.0 2009-06-07 23:59:36
10 rows fetched in 0.0003s (0.0017s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 10 100.00 Using filesort
2 DERIVED t_latest2 range ix_latest2_ip_ts 4 19 100.00 Using index for group-by

Same query on the table with lots of IP's:

SELECT  *
FROM    (
        SELECT  INET_NTOA(ip) AS textip, MAX(ts) AS maxts
        FROM    t_latest
        GROUP BY
                ip
        ) q
ORDER BY
        maxts DESC
LIMIT 10

textip maxts
10.0.238.37 2009-06-07 23:59:59
10.0.189.96 2009-06-07 23:59:58
10.0.232.111 2009-06-07 23:59:57
10.0.82.15 2009-06-07 23:59:56
10.0.224.254 2009-06-07 23:59:55
10.0.110.202 2009-06-07 23:59:54
10.0.134.250 2009-06-07 23:59:53
10.0.86.129 2009-06-07 23:59:52
10.0.27.154 2009-06-07 23:59:51
10.0.134.128 2009-06-07 23:59:50
10 rows fetched in 0.0003s (1.8750s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 65536 100.00 Using filesort
2 DERIVED t_latest range ix_latest_ip_ts 4 142943 100.00 Using index for group-by

As you can see, this query is instant for the table with few distinct IP's but takes almost 2 seconds when there are lots of IP's.

The reason is obvious: the latter query needs to calculate and sort much more aggregates.

Each aggregate is almost instant, since it uses a single index lookup to find out the MAX value, so the number of aggregated records in each group doesn't influence the query time.

It's only the number of groups that counts, and it's huge for the second query. since there are lots of distinct IP's.

For the latter table, it's better to use the solution 2. We need to sort the records by ts DESC, and for each record we will check that there are no records with same IP but greater ts.

This guarantees that for each IP, only the highest ts will be selected.

Of course, the duplicates are possible, but they can be easily removed by using DISTINCT clause.

Here's the query for the table with few IP's:

SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest2 lo
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10
textip ts
10.0.0.9 2009-06-07 23:59:59
10.0.0.7 2009-06-07 23:59:58
10.0.0.3 2009-06-07 23:59:56
10.0.0.8 2009-06-07 23:59:55
10.0.0.4 2009-06-07 23:59:54
10.0.0.5 2009-06-07 23:59:53
10.0.0.1 2009-06-07 23:59:51
10.0.0.6 2009-06-07 23:59:40
10.0.0.2 2009-06-07 23:59:38
10.0.0.0 2009-06-07 23:59:36
10 rows fetched in 0.0003s (1.8906s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY lo index ix_latest2_ts 8 10 9979060.00 Using where; Using temporary
2 DEPENDENT SUBQUERY li ref ix_latest2_ts,ix_latest2_ip_ts ix_latest2_ip_ts 4 20090608_latest.lo.ip 55439 100.00 Using where; Using index

, and same for the table with lots of IP's:

SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest lo
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    t_latest li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10
textip ts
10.0.238.37 2009-06-07 23:59:59
10.0.189.96 2009-06-07 23:59:58
10.0.232.111 2009-06-07 23:59:57
10.0.82.15 2009-06-07 23:59:56
10.0.224.254 2009-06-07 23:59:55
10.0.110.202 2009-06-07 23:59:54
10.0.134.250 2009-06-07 23:59:53
10.0.86.129 2009-06-07 23:59:52
10.0.27.154 2009-06-07 23:59:51
10.0.134.128 2009-06-07 23:59:50
10 rows fetched in 0.0003s (0.0019s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY lo index ix_latest_ts 8 10 10005940.00 Using where; Using temporary
2 DEPENDENT SUBQUERY li ref ix_latest_ts,ix_latest_ip_ts ix_latest_ip_ts 4 20090608_latest.lo.ip 7 100.00 Using where; Using index

The situation is diametrically opposite to that of the GROUP BY queries: the query is very slow for the table with few IP's and is almost instant when there are lots of IP's.

This is because for this query, the EXISTS subquery is executed for each IP met so far and involves an index scan over all records for this IP.

If there are few IP's, there are lots of records for each of these IP's, and the subquery takes a long time to complete.

If there are lot's of IP's, each subquery is almost instant.

And since there is a very low probablility of meeting the same IP twice, all 10 records most probably will be selected in 10 iterations.

But which approach should be used?

Since real world IP's are quite numerous, in a real pagehit table I'd use the second approach (with EXISTS).

However, it will only be efficient if there is an index on both (ts) and (ip, ts).

Written by Quassnoi

June 8th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply