Counting bans
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a MyISAM table in MySQL 5.0 like this:
id login ip banned I would like to find all users not banned (
banned = 0
) if at least 5 other users with the sameip
have already been banned (banned = 1
).
To do this, we should first find which IP
's have more than 5 bans and then find all non-banned users for these IP
's.
To find the IP
's we can use two methods:
- Find all banned
IP
's, group them and filter out those that haveCOUNT(*) < 5
- For each
IP
possible, find if there are 5 banned records for thisIP
using aLIMIT
clause
The first method doesn't require doing an extra JOIN
on the table, but it will do count all values just to figure out if the COUNT(*)
is more or less than 5.
The second method stops counting whenever it find the 5th value, but it requires an extra JOIN
.
Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN
loops.
This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP
's.
Let's create two sample tables and see:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_ban ( id INT NOT NULL PRIMARY KEY, ip INT NOT NULL, banned TINYINT NOT NULL, username VARCHAR(20) NOT NULL, KEY ix_ban_ip_banned (ip, banned), KEY ix_ban_banned (banned) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE t_ban2 ( id INT NOT NULL PRIMARY KEY, ip INT NOT NULL, banned TINYINT NOT NULL, username VARCHAR(20) NOT NULL, KEY ix_ban2_ip_banned (ip, banned), KEY ix_ban2_banned (banned) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_ban (id, ip, banned, username) SELECT id, INET_ATON('10.0.0.0') + FLOOR(RAND(20090618) * 2048), RAND(20090618 << 1) < 0.99, CONCAT('User ', id) FROM filler; INSERT INTO t_ban2 (id, ip, banned, username) SELECT id, INET_ATON('10.0.0.0') + FLOOR(RAND(20090618 << 2) * 256), RAND(20090618 << 3) < 0.0003, CONCAT('User ', id) FROM filler; [/sourcecode] This creates two tables: <code>t_ban</code> and <code>t_ban2</code>. <li><code>t_ban</code>'s <code>IP</code>'s are randomly disributed over the range of <code>2048</code> values and there are <code>99%</code> banned records in <code>t_ban</code> </li> <code>t_ban2</code>'s <code>IP</code>'s are randomly disributed over the range of <code>246</code> values and there are <code>0,3%</code> banned records in <code>t_ban</code>. In other words, almost everybody is <strong>banned</strong> in t_ban, and almost everybody is not banned in <code>t_ban2</code>. Now, let's run the queries against the first table: SELECT COUNT(*) FROM ( SELECT ip FROM ( SELECT DISTINCT ip FROM t_ban ) bd WHERE EXISTS ( SELECT 1 FROM t_ban bb WHERE bb.ip = bd.ip AND banned LIMIT 4, 1 ) ) bi JOIN t_ban bo ON bo.ip = bi.ip AND NOT bo.banned
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9961 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7969s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 2048 | 100.00 | |||||
1 | PRIMARY | bo | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bi.ip | 214 | 100.00 | Using where; Using index |
2 | DERIVED | <derived3> | ALL | 2048 | 100.00 | Using where | ||||
4 | DEPENDENT SUBQUERY | bb | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bd.ip | 214 | 100.00 | Using where; Using index |
3 | DERIVED | t_ban | range | ix_ban_ip_banned | 4 | 4661 | 100.00 | Using index for group-by |
This query uses the first method:
- It builds a set of
DISTINCT
IP
's - For each IP it checks if there is 5th banned record with this
IP
- It joins the resultset of the found
IP
's back with the table
This query runs for 0,79 s.
Same using a GROUP BY
:
SELECT COUNT(*) FROM ( SELECT ip FROM t_ban bd WHERE banned GROUP BY ip HAVING COUNT(*) >= 5 ) bi JOIN t_ban bo ON bo.ip = bi.ip AND NOT bo.banned
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9961 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (1.4063s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 2048 | 100.00 | |||||
1 | PRIMARY | bo | ref | ix_ban_ip_banned | ix_ban_ip_banned | 4 | bi.ip | 230 | 100.00 | Using where; Using index |
2 | DERIVED | bd | index | ix_ban_ip_banned | 5 | 997469 | 100.00 | Using where; Using index |
This does the same, but counts all values.
This query is two times as slow as the EXISTS
one: there are lots of values to COUNT
.
Now, let's check the same for the second table (which has few banned users).
First, the EXISTS
query:
SELECT COUNT(*) FROM ( SELECT ip FROM ( SELECT DISTINCT ip FROM t_ban2 ) bd WHERE EXISTS ( SELECT 1 FROM t_ban2 bb WHERE bb.ip = bd.ip AND banned LIMIT 4, 1 ) ) bi JOIN t_ban2 bo ON bo.ip = bi.ip AND NOT bo.banned
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4011 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.8875s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||
1 | PRIMARY | bo | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | const | 2894 | 100.00 | Using where; Using index |
2 | DERIVED | <derived3> | ALL | 256 | 100.00 | Using where | ||||
4 | DEPENDENT SUBQUERY | bb | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | bd.ip | 2733 | 100.00 | Using where; Using index |
3 | DERIVED | t_ban2 | range | ix_ban2_ip_banned | 4 | 367 | 100.00 | Using index for group-by |
And now the aggregate:
SELECT COUNT(*) FROM ( SELECT ip FROM t_ban2 bd WHERE banned GROUP BY ip HAVING COUNT(*) >= 5 ) bi JOIN t_ban2 bo ON bo.ip = bi.ip AND NOT bo.banned
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4011 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.6406s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | system | 1 | 100.00 | |||||
1 | PRIMARY | bo | ref | ix_ban2_ip_banned | ix_ban2_ip_banned | 4 | const | 2894 | 100.00 | Using where; Using index |
2 | DERIVED | bd | index | ix_ban2_ip_banned | 5 | 1000642 | 100.00 | Using where; Using index |
As you can see, if there are few banned IP
's, the COUNT
becomes more efficent.
This is because aggregating a small resultset is faster than performing a self JOIN
for each DISTINCT IP
.
In t_ban2
, there are too few IP
's that have COUNT(*) < 5
, so counting as such is not a problem anymore, and an extra JOIN
is able to kill the performance benefits.
Summary:
There are two methods to filter out the IP
's that have less than 5 banned records:
- Finding 5th record for each
DISTINCT IP
usingEXISTS
- Aggregation and filtering on
COUNT()
's
If there are many banned records, the first method is faster, since it doesn't need to count.
If there are few banned records, the second method is faster, since it doesn't need to JOIN
.