Latest DISTINCT records: efficient subquery at EXPLAIN EXTENDED

Latest DISTINCT records: efficient subquery

Comments enabled. I *really* need your comment

In my yesterday's article:

Latest DISTINCT records

I described an efficient query to select 10 latest distinct IP's from a pagehit table in MySQL.

Here's the query:

SELECT  INET_NTOA(ip), ts
FROM    t_latest2 lo
WHERE   NOT EXISTS (
        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10

, which works all right if there are lots of distinct IP's (as it is in a real table).

However, due to the bug in MySQL, this query is not as efficient as it should be.

Let's create the sample table and see what the problem is:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_latest2 (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        ts DATETIME NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_latest2_ts (ts),
        KEY ix_latest2_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_latest2 (id, ip, ts, value)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090609) * 10),
        CAST('2009-06-09' AS DATETIME) - INTERVAL id SECOND,
        CONCAT('Record ', id)
FROM    filler;

ANALYZE TABLE t_latest2;
&#91;/sourcecode&#93;

Just as in the yesterday's article, this table contains <strong>1,000,000</strong> rows with <strong>10</strong> distinct <code>IP</code>'s randomly distributed.

Of course, a real pagehit table would have much more distinct <code>IP</code>'s, but I intentionally created so few <code>IP</code>'s to show the point.

Now, let's run the query against this table:


SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest2 lo
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10

textip	ts
10.0.0.1	2009-06-08 23:59:59
10.0.0.9	2009-06-08 23:59:58
10.0.0.4	2009-06-08 23:59:57
10.0.0.0	2009-06-08 23:59:56
10.0.0.6	2009-06-08 23:59:54
10.0.0.8	2009-06-08 23:59:53
10.0.0.3	2009-06-08 23:59:51
10.0.0.7	2009-06-08 23:59:48
10.0.0.5	2009-06-08 23:59:36
10.0.0.2	2009-06-08 23:59:34
10 rows fetched in 0.0003s (2.0468s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lo	index		ix_latest2_ts	8		10	10034320.00	Using where; Using temporary
2	DEPENDENT SUBQUERY	li	ref	ix_latest2_ts,ix_latest2_ip_ts	ix_latest2_ip_ts	4	20090609_latest.lo.ip	55746	100.00	Using where; Using index

As you can see, this query runs for more than 2 seconds.

Why so long?

If we look closer into the query's plan, we will see that the DEPENDENT SUBQUERY (the one inside the NOT EXISTS clause) uses ref access method along with using where.

What does it mean?

The query takes each latest rows from the table, one by one, and for each row invokes the subquery which searches: are there any rows in the table with same IP and greater ts?

If this subquery returns a row, then there had already been a record with same IP but higher ts in the recordset, and, therefore, the current row should be rejected.

Since the records are sorted by ts in descending order, first 10 non-rejected rows will be 10 latest records with distinct IP's.

Now, the correlated subquery which checks the previous records looks like this:

        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts

We see that it takes two values (lo.ip and lo.ts) from the parent recordset and tries to find a single record satisfying this condition: li.ip = lo.ip AND li.ts > lo.ts.

Since we have an index on (ip, ts), a plain MySQL query would use range access method to filter on this condition.

Indeed, we need to have a single contiguous range here: (lo.ip, lo.ts) > (ip, ts) > (lo.ip, +Inf). Since lo.ts is most probably very close to the largest ts in the table, this is a very short range which can be browsed in no time.

Unfortunately, due to bug 19836 in MySQL, range access method cannot be used by a the correlated subquery.

That's why it uses less efficient ref access method, which browses all records with li.ip = lo.ip using the index and filters out the wrong ts's.

It's not a problem when there are few records for each given IP, but since we have only 10 distinct IP's here, there are about 100,000 records with each given IP, and browsing all these records becomes expensive.

What can we do about this?

We need this subquery to find out if there are any records with same IP and ts greater than the current one.

That means we could just sort the records with given IP by ts DESC, and see the ts of record which returns first.

If the returned record has ts more than the current one, then we'll need to reject the row, else accept it and return.

MySQL allows using index for ORDER BY inside the query.

Let's try it:

SELECT  DISTINCT INET_NTOA(ip) AS textip, ts
FROM    t_latest2 lo
WHERE   ts =
        (
        SELECT  ts
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
        ORDER BY
                li.IP DESC, li.ts DESC
        LIMIT 1
        )
ORDER BY
        ts DESC
LIMIT 10

textip	ts
10.0.0.1	2009-06-08 23:59:59
10.0.0.9	2009-06-08 23:59:58
10.0.0.4	2009-06-08 23:59:57
10.0.0.0	2009-06-08 23:59:56
10.0.0.6	2009-06-08 23:59:54
10.0.0.8	2009-06-08 23:59:53
10.0.0.3	2009-06-08 23:59:51
10.0.0.7	2009-06-08 23:59:48
10.0.0.5	2009-06-08 23:59:36
10.0.0.2	2009-06-08 23:59:34
10 rows fetched in 0.0003s (0.0021s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	lo	index		ix_latest2_ts	8		10	10034320.00	Using where; Using temporary
2	DEPENDENT SUBQUERY	li	ref	ix_latest2_ip_ts	ix_latest2_ip_ts	4	20090609_latest.lo.ip	55746	100.00	Using where; Using index

The ref access is still used by the subquery.

But since we don't need to make sure that there are no records with greater ts (which requres browsing all values for the given IP), but instead to select the first record for the given IP (which returns as soon as this record is found, which is extra fast), ref access method is OK here, and the query is instant.

If the IP's are distributed more or less evenly over the table, this query will suit both cases (few distinct IP's and lots of distinct IP's) equally well.

Written by Quassnoi

June 9th, 2009 at 11:00 pm

Posted in MySQL

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

EXPLAIN EXTENDED