Latest DISTINCT records: efficient subquery
Comments enabled. I *really* need your comment
In my yesterday's article:
I described an efficient query to select 10 latest distinct IP
's from a pagehit table in MySQL.
Here's the query:
SELECT INET_NTOA(ip), ts FROM t_latest2 lo WHERE NOT EXISTS ( SELECT 1 FROM t_latest2 li WHERE li.ip = lo.ip AND li.ts > lo.ts ) ORDER BY ts DESC LIMIT 10
, which works all right if there are lots of distinct IP
's (as it is in a real table).
However, due to the bug in MySQL, this query is not as efficient as it should be.
Let's create the sample table and see what the problem is:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_latest2 ( id INT NOT NULL PRIMARY KEY, ip INT NOT NULL, ts DATETIME NOT NULL, value VARCHAR(50) NOT NULL, KEY ix_latest2_ts (ts), KEY ix_latest2_ip_ts (ip, ts) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(100000); COMMIT; INSERT INTO t_latest2 (id, ip, ts, value) SELECT id, INET_ATON('10.0.0.0') + FLOOR(RAND(20090609) * 10), CAST('2009-06-09' AS DATETIME) - INTERVAL id SECOND, CONCAT('Record ', id) FROM filler; ANALYZE TABLE t_latest2; [/sourcecode] Just as in the yesterday's article, this table contains <strong>1,000,000</strong> rows with <strong>10</strong> distinct <code>IP</code>'s randomly distributed. Of course, a real pagehit table would have much more distinct <code>IP</code>'s, but I intentionally created so few <code>IP</code>'s to show the point. Now, let's run the query against this table: SELECT DISTINCT INET_NTOA(ip) AS textip, ts FROM t_latest2 lo WHERE NOT EXISTS ( SELECT 1 FROM t_latest2 li WHERE li.ip = lo.ip AND li.ts > lo.ts ) ORDER BY ts DESC LIMIT 10
textip | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10.0.0.1 | 2009-06-08 23:59:59 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.9 | 2009-06-08 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.4 | 2009-06-08 23:59:57 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.0 | 2009-06-08 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.6 | 2009-06-08 23:59:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.8 | 2009-06-08 23:59:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.3 | 2009-06-08 23:59:51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.7 | 2009-06-08 23:59:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.5 | 2009-06-08 23:59:36 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.2 | 2009-06-08 23:59:34 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 rows fetched in 0.0003s (2.0468s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | lo | index | ix_latest2_ts | 8 | 10 | 10034320.00 | Using where; Using temporary | ||
2 | DEPENDENT SUBQUERY | li | ref | ix_latest2_ts,ix_latest2_ip_ts | ix_latest2_ip_ts | 4 | 20090609_latest.lo.ip | 55746 | 100.00 | Using where; Using index |
As you can see, this query runs for more than 2 seconds.
Why so long?
If we look closer into the query's plan, we will see that the DEPENDENT SUBQUERY
(the one inside the NOT EXISTS
clause) uses ref
access method along with using where
.
What does it mean?
The query takes each latest rows from the table, one by one, and for each row invokes the subquery which searches: are there any rows in the table with same IP
and greater ts
?
If this subquery returns a row, then there had already been a record with same IP
but higher ts
in the recordset, and, therefore, the current row should be rejected.
Since the records are sorted by ts
in descending order, first 10 non-rejected rows will be 10 latest records with distinct IP
's.
Now, the correlated subquery which checks the previous records looks like this:
SELECT 1 FROM t_latest2 li WHERE li.ip = lo.ip AND li.ts > lo.ts
We see that it takes two values (lo.ip
and lo.ts
) from the parent recordset and tries to find a single record satisfying this condition: li.ip = lo.ip AND li.ts > lo.ts
.
Since we have an index on (ip, ts)
, a plain MySQL query would use range
access method to filter on this condition.
Indeed, we need to have a single contiguous range here: (lo.ip, lo.ts) > (ip, ts) > (lo.ip, +Inf)
. Since lo.ts
is most probably very close to the largest ts
in the table, this is a very short range which can be browsed in no time.
Unfortunately, due to bug 19836 in MySQL, range
access method cannot be used by a the correlated subquery.
That's why it uses less efficient ref
access method, which browses all records with li.ip = lo.ip
using the index and filters out the wrong ts
's.
It's not a problem when there are few records for each given IP
, but since we have only 10 distinct IP
's here, there are about 100,000 records with each given IP
, and browsing all these records becomes expensive.
What can we do about this?
We need this subquery to find out if there are any records with same IP
and ts
greater than the current one.
That means we could just sort the records with given IP
by ts DESC
, and see the ts
of record which returns first.
If the returned record has ts
more than the current one, then we'll need to reject the row, else accept it and return.
MySQL allows using index for ORDER BY
inside the query.
Let's try it:
SELECT DISTINCT INET_NTOA(ip) AS textip, ts FROM t_latest2 lo WHERE ts = ( SELECT ts FROM t_latest2 li WHERE li.ip = lo.ip ORDER BY li.IP DESC, li.ts DESC LIMIT 1 ) ORDER BY ts DESC LIMIT 10
textip | ts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10.0.0.1 | 2009-06-08 23:59:59 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.9 | 2009-06-08 23:59:58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.4 | 2009-06-08 23:59:57 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.0 | 2009-06-08 23:59:56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.6 | 2009-06-08 23:59:54 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.8 | 2009-06-08 23:59:53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.3 | 2009-06-08 23:59:51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.7 | 2009-06-08 23:59:48 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.5 | 2009-06-08 23:59:36 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10.0.0.2 | 2009-06-08 23:59:34 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 rows fetched in 0.0003s (0.0021s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | lo | index | ix_latest2_ts | 8 | 10 | 10034320.00 | Using where; Using temporary | ||
2 | DEPENDENT SUBQUERY | li | ref | ix_latest2_ip_ts | ix_latest2_ip_ts | 4 | 20090609_latest.lo.ip | 55746 | 100.00 | Using where; Using index |
The ref
access is still used by the subquery.
But since we don't need to make sure that there are no records with greater ts
(which requres browsing all values for the given IP
), but instead to select the first record for the given IP
(which returns as soon as this record is found, which is extra fast), ref
access method is OK here, and the query is instant.
If the IP
's are distributed more or less evenly over the table, this query will suit both cases (few distinct IP
's and lots of distinct IP
's) equally well.