Latest DISTINCT records: efficient subquery
Comments enabled. I *really* need your comment
In my yesterday's article:
I described an efficient query to select 10 latest distinct IP's from a pagehit table in MySQL.
Here's the query:
SELECT INET_NTOA(ip), ts
FROM t_latest2 lo
WHERE NOT EXISTS (
SELECT 1
FROM t_latest2 li
WHERE li.ip = lo.ip
AND li.ts > lo.ts
)
ORDER BY
ts DESC
LIMIT 10
, which works all right if there are lots of distinct IP's (as it is in a real table).
However, due to the bug in MySQL, this query is not as efficient as it should be.
Let's create the sample table and see what the problem is:
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_latest2 (
id INT NOT NULL PRIMARY KEY,
ip INT NOT NULL,
ts DATETIME NOT NULL,
value VARCHAR(50) NOT NULL,
KEY ix_latest2_ts (ts),
KEY ix_latest2_ip_ts (ip, ts)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(100000);
COMMIT;
INSERT
INTO t_latest2 (id, ip, ts, value)
SELECT id,
INET_ATON('10.0.0.0') + FLOOR(RAND(20090609) * 10),
CAST('2009-06-09' AS DATETIME) - INTERVAL id SECOND,
CONCAT('Record ', id)
FROM filler;
ANALYZE TABLE t_latest2;
[/sourcecode]
Just as in the yesterday's article, this table contains <strong>1,000,000</strong> rows with <strong>10</strong> distinct <code>IP</code>'s randomly distributed.
Of course, a real pagehit table would have much more distinct <code>IP</code>'s, but I intentionally created so few <code>IP</code>'s to show the point.
Now, let's run the query against this table:
SELECT DISTINCT INET_NTOA(ip) AS textip, ts
FROM t_latest2 lo
WHERE NOT EXISTS
(
SELECT 1
FROM t_latest2 li
WHERE li.ip = lo.ip
AND li.ts > lo.ts
)
ORDER BY
ts DESC
LIMIT 10
| textip | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.0.0.1 | 2009-06-08 23:59:59 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.9 | 2009-06-08 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.4 | 2009-06-08 23:59:57 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.0 | 2009-06-08 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.6 | 2009-06-08 23:59:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.8 | 2009-06-08 23:59:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.3 | 2009-06-08 23:59:51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.7 | 2009-06-08 23:59:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.5 | 2009-06-08 23:59:36 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.2 | 2009-06-08 23:59:34 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 rows fetched in 0.0003s (2.0468s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | lo | index | ix_latest2_ts | 8 | 10 | 10034320.00 | Using where; Using temporary | ||
| 2 | DEPENDENT SUBQUERY | li | ref | ix_latest2_ts,ix_latest2_ip_ts | ix_latest2_ip_ts | 4 | 20090609_latest.lo.ip | 55746 | 100.00 | Using where; Using index |
As you can see, this query runs for more than 2 seconds.
Why so long?
If we look closer into the query's plan, we will see that the DEPENDENT SUBQUERY (the one inside the NOT EXISTS clause) uses ref access method along with using where.
What does it mean?
The query takes each latest rows from the table, one by one, and for each row invokes the subquery which searches: are there any rows in the table with same IP and greater ts?
If this subquery returns a row, then there had already been a record with same IP but higher ts in the recordset, and, therefore, the current row should be rejected.
Since the records are sorted by ts in descending order, first 10 non-rejected rows will be 10 latest records with distinct IP's.
Now, the correlated subquery which checks the previous records looks like this:
SELECT 1
FROM t_latest2 li
WHERE li.ip = lo.ip
AND li.ts > lo.ts
We see that it takes two values (lo.ip and lo.ts) from the parent recordset and tries to find a single record satisfying this condition: li.ip = lo.ip AND li.ts > lo.ts.
Since we have an index on (ip, ts), a plain MySQL query would use range access method to filter on this condition.
Indeed, we need to have a single contiguous range here: (lo.ip, lo.ts) > (ip, ts) > (lo.ip, +Inf). Since lo.ts is most probably very close to the largest ts in the table, this is a very short range which can be browsed in no time.
Unfortunately, due to bug 19836 in MySQL, range access method cannot be used by a the correlated subquery.
That's why it uses less efficient ref access method, which browses all records with li.ip = lo.ip using the index and filters out the wrong ts's.
It's not a problem when there are few records for each given IP, but since we have only 10 distinct IP's here, there are about 100,000 records with each given IP, and browsing all these records becomes expensive.
What can we do about this?
We need this subquery to find out if there are any records with same IP and ts greater than the current one.
That means we could just sort the records with given IP by ts DESC, and see the ts of record which returns first.
If the returned record has ts more than the current one, then we'll need to reject the row, else accept it and return.
MySQL allows using index for ORDER BY inside the query.
Let's try it:
SELECT DISTINCT INET_NTOA(ip) AS textip, ts
FROM t_latest2 lo
WHERE ts =
(
SELECT ts
FROM t_latest2 li
WHERE li.ip = lo.ip
ORDER BY
li.IP DESC, li.ts DESC
LIMIT 1
)
ORDER BY
ts DESC
LIMIT 10
| textip | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.0.0.1 | 2009-06-08 23:59:59 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.9 | 2009-06-08 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.4 | 2009-06-08 23:59:57 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.0 | 2009-06-08 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.6 | 2009-06-08 23:59:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.8 | 2009-06-08 23:59:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.3 | 2009-06-08 23:59:51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.7 | 2009-06-08 23:59:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.5 | 2009-06-08 23:59:36 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10.0.0.2 | 2009-06-08 23:59:34 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 rows fetched in 0.0003s (0.0021s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | lo | index | ix_latest2_ts | 8 | 10 | 10034320.00 | Using where; Using temporary | ||
| 2 | DEPENDENT SUBQUERY | li | ref | ix_latest2_ip_ts | ix_latest2_ip_ts | 4 | 20090609_latest.lo.ip | 55746 | 100.00 | Using where; Using index |
The ref access is still used by the subquery.
But since we don't need to make sure that there are no records with greater ts (which requres browsing all values for the given IP), but instead to select the first record for the given IP (which returns as soon as this record is found, which is extra fast), ref access method is OK here, and the query is instant.
If the IP's are distributed more or less evenly over the table, this query will suit both cases (few distinct IP's and lots of distinct IP's) equally well.
Subscribe in a reader