Counting bans
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a MyISAM table in MySQL 5.0 like this:
id login ip banned I would like to find all users not banned (
banned = 0) if at least 5 other users with the sameiphave already been banned (banned = 1).
To do this, we should first find which IP's have more than 5 bans and then find all non-banned users for these IP's.
To find the IP's we can use two methods:
- Find all banned
IP's, group them and filter out those that haveCOUNT(*) < 5 - For each
IPpossible, find if there are 5 banned records for thisIPusing aLIMITclause
The first method doesn't require doing an extra JOIN on the table, but it will do count all values just to figure out if the COUNT(*) is more or less than 5.
The second method stops counting whenever it find the 5th value, but it requires an extra JOIN.
Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN loops.
This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP's.
Let's create two sample tables and see:
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;
CREATE TABLE t_ban (
id INT NOT NULL PRIMARY KEY,
ip INT NOT NULL,
banned TINYINT NOT NULL,
username VARCHAR(20) NOT NULL,
KEY ix_ban_ip_banned (ip, banned),
KEY ix_ban_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE t_ban2 (
id INT NOT NULL PRIMARY KEY,
ip INT NOT NULL,
banned TINYINT NOT NULL,
username VARCHAR(20) NOT NULL,
KEY ix_ban2_ip_banned (ip, banned),
KEY ix_ban2_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;
INSERT
INTO t_ban (id, ip, banned, username)
SELECT id,
INET_ATON('10.0.0.0') + FLOOR(RAND(20090618) * 2048),
RAND(20090618 << 1) < 0.99,
CONCAT('User ', id)
FROM filler;
INSERT
INTO t_ban2 (id, ip, banned, username)
SELECT id,
INET_ATON('10.0.0.0') + FLOOR(RAND(20090618 << 2) * 256),
RAND(20090618 << 3) < 0.0003,
CONCAT('User ', id)
FROM filler;
[/sourcecode]
This creates two tables: <code>t_ban</code> and <code>t_ban2</code>.
<li><code>t_ban</code>'s <code>IP</code>'s are randomly disributed over the range of <code>2048</code> values and there are <code>99%</code> banned records in <code>t_ban</code>
</li>
<code>t_ban2</code>'s <code>IP</code>'s are randomly disributed over the range of <code>246</code> values and there are <code>0,3%</code> banned records in <code>t_ban</code>.
In other words, almost everybody is <strong>banned</strong> in t_ban, and almost everybody is not banned in <code>t_ban2</code>.
Now, let's run the queries against the first table:
SELECT COUNT(*)
FROM (
SELECT ip
FROM (
SELECT DISTINCT ip
FROM t_ban
) bd
WHERE EXISTS
(
SELECT 1
FROM t_ban bb
WHERE bb.ip = bd.ip
AND banned
LIMIT 4, 1
)
) bi
JOIN t_ban bo
ON bo.ip = bi.ip
AND NOT bo.banned
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9961 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.7969s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | ALL | 2048 | 100.00 | |||||
| 1 | PRIMARY | bo | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bi.ip | 214 | 100.00 | Using where; Using index |
| 2 | DERIVED | <derived3> | ALL | 2048 | 100.00 | Using where | ||||
| 4 | DEPENDENT SUBQUERY | bb | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bd.ip | 214 | 100.00 | Using where; Using index |
| 3 | DERIVED | t_ban | range | ix_ban_ip_banned | 4 | 4661 | 100.00 | Using index for group-by |
This query uses the first method:
- It builds a set of
DISTINCTIP's - For each IP it checks if there is 5th banned record with this
IP - It joins the resultset of the found
IP's back with the table
This query runs for 0,79 s.
Same using a GROUP BY:
SELECT COUNT(*)
FROM (
SELECT ip
FROM t_ban bd
WHERE banned
GROUP BY
ip
HAVING COUNT(*) >= 5
) bi
JOIN t_ban bo
ON bo.ip = bi.ip
AND NOT bo.banned
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9961 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (1.4063s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | ALL | 2048 | 100.00 | |||||
| 1 | PRIMARY | bo | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bi.ip | 230 | 100.00 | Using where; Using index |
| 2 | DERIVED | bd | index | ix_ban_ip_banned | 5 | 997469 | 100.00 | Using where; Using index |
This does the same, but counts all values.
This query is two times as slow as the EXISTS one: there are lots of values to COUNT.
Now, let's check the same for the second table (which has few banned users).
First, the EXISTS query:
SELECT COUNT(*)
FROM (
SELECT ip
FROM (
SELECT DISTINCT ip
FROM t_ban2
) bd
WHERE EXISTS
(
SELECT 1
FROM t_ban2 bb
WHERE bb.ip = bd.ip
AND banned
LIMIT 4, 1
)
) bi
JOIN t_ban2 bo
ON bo.ip = bi.ip
AND NOT bo.banned
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4011 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.8875s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||
| 1 | PRIMARY | bo | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | const | 2894 | 100.00 | Using where; Using index |
| 2 | DERIVED | <derived3> | ALL | 256 | 100.00 | Using where | ||||
| 4 | DEPENDENT SUBQUERY | bb | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | bd.ip | 2733 | 100.00 | Using where; Using index |
| 3 | DERIVED | t_ban2 | range | ix_ban2_ip_banned | 4 | 367 | 100.00 | Using index for group-by |
And now the aggregate:
SELECT COUNT(*)
FROM (
SELECT ip
FROM t_ban2 bd
WHERE banned
GROUP BY
ip
HAVING COUNT(*) >= 5
) bi
JOIN t_ban2 bo
ON bo.ip = bi.ip
AND NOT bo.banned
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4011 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.6406s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||
| 1 | PRIMARY | bo | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | const | 2894 | 100.00 | Using where; Using index |
| 2 | DERIVED | bd | index | ix_ban2_ip_banned | 5 | 1000642 | 100.00 | Using where; Using index |
As you can see, if there are few banned IP's, the COUNT becomes more efficent.
This is because aggregating a small resultset is faster than performing a self JOIN for each DISTINCT IP.
In t_ban2, there are too few IP's that have COUNT(*) < 5, so counting as such is not a problem anymore, and an extra JOIN is able to kill the performance benefits.
Summary:
There are two methods to filter out the IP's that have less than 5 banned records:
- Finding 5th record for each
DISTINCT IPusingEXISTS - Aggregation and filtering on
COUNT()'s
If there are many banned records, the first method is faster, since it doesn't need to count.
If there are few banned records, the second method is faster, since it doesn't need to JOIN.
Subscribe in a reader