Selecting IP's that have more than 5 banned records at EXPLAIN EXTENDED

Counting bans

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a MyISAM table in MySQL 5.0 like this:

id login ip banned

I would like to find all users not banned (banned = 0) if at least 5 other users with the same ip have already been banned (banned = 1).

To do this, we should first find which IP's have more than 5 bans and then find all non-banned users for these IP's.

To find the IP's we can use two methods:

Find all banned IP's, group them and filter out those that have COUNT(*) < 5
For each IP possible, find if there are 5 banned records for this IP using a LIMIT clause

The first method doesn't require doing an extra JOIN on the table, but it will do count all values just to figure out if the COUNT(*) is more or less than 5.

The second method stops counting whenever it find the 5th value, but it requires an extra JOIN.

Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN loops.

This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP's.

Let's create two sample tables and see:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_ban (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        banned TINYINT NOT NULL,
        username VARCHAR(20) NOT NULL,
        KEY ix_ban_ip_banned (ip, banned),
        KEY ix_ban_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

CREATE TABLE t_ban2 (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        banned TINYINT NOT NULL,
        username VARCHAR(20) NOT NULL,
        KEY ix_ban2_ip_banned (ip, banned),
        KEY ix_ban2_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_ban (id, ip, banned, username)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090618) * 2048),
        RAND(20090618 << 1) < 0.99,
        CONCAT('User ', id)
FROM    filler;

INSERT
INTO    t_ban2 (id, ip, banned, username)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090618 << 2) * 256),
        RAND(20090618 << 3) < 0.0003,
        CONCAT('User ', id)
FROM    filler;
&#91;/sourcecode&#93;

This creates two tables: <code>t_ban</code> and <code>t_ban2</code>.

	<li><code>t_ban</code>'s <code>IP</code>'s are randomly disributed over the range of <code>2048</code> values and there are <code>99%</code> banned records in <code>t_ban</code>
</li>
<code>t_ban2</code>'s <code>IP</code>'s are randomly disributed over the range of <code>246</code> values and there are <code>0,3%</code> banned records in <code>t_ban</code>.

In other words, almost everybody is <strong>banned</strong> in t_ban, and almost everybody is not banned in <code>t_ban2</code>.

Now, let's run the queries against the first table:


SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    (
                SELECT  DISTINCT ip
                FROM    t_ban
                ) bd
        WHERE   EXISTS
                (
                SELECT  1
                FROM    t_ban bb
                WHERE   bb.ip = bd.ip
                        AND banned
                LIMIT 4, 1
                )
        ) bi
JOIN    t_ban bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
9961
1 row fetched in 0.0001s (0.7969s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					2048	100.00
1	PRIMARY	bo	ref	ix_ban_ip_banned	ix_ban_ip_banned	4	bi.ip	214	100.00	Using where; Using index
2	DERIVED	<derived3>	ALL					2048	100.00	Using where
4	DEPENDENT SUBQUERY	bb	ref	ix_ban_ip_banned	ix_ban_ip_banned	4	bd.ip	214	100.00	Using where; Using index
3	DERIVED	t_ban	range		ix_ban_ip_banned	4		4661	100.00	Using index for group-by

This query uses the first method:

It builds a set of DISTINCT IP's
For each IP it checks if there is 5th banned record with this IP
It joins the resultset of the found IP's back with the table

This query runs for 0,79 s.

Same using a GROUP BY:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    t_ban bd
        WHERE   banned
        GROUP BY
                ip
        HAVING COUNT(*) >= 5
        ) bi
JOIN    t_ban bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
9961
1 row fetched in 0.0001s (1.4063s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					2048	100.00
1	PRIMARY	bo	ref	ix_ban_ip_banned	ix_ban_ip_banned	4	bi.ip	230	100.00	Using where; Using index
2	DERIVED	bd	index		ix_ban_ip_banned	5		997469	100.00	Using where; Using index

This does the same, but counts all values.

This query is two times as slow as the EXISTS one: there are lots of values to COUNT.

Now, let's check the same for the second table (which has few banned users).

First, the EXISTS query:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    (
                SELECT  DISTINCT ip
                FROM    t_ban2
                ) bd
        WHERE   EXISTS
                (
                SELECT  1
                FROM    t_ban2 bb
                WHERE   bb.ip = bd.ip
                        AND banned
                LIMIT 4, 1
                )
        ) bi
JOIN    t_ban2 bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
4011
1 row fetched in 0.0001s (0.8875s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	system					1	100.00
1	PRIMARY	bo	ref	ix_ban2_ip_banned	ix_ban2_ip_banned	4	const	2894	100.00	Using where; Using index
2	DERIVED	<derived3>	ALL					256	100.00	Using where
4	DEPENDENT SUBQUERY	bb	ref	ix_ban2_ip_banned	ix_ban2_ip_banned	4	bd.ip	2733	100.00	Using where; Using index
3	DERIVED	t_ban2	range		ix_ban2_ip_banned	4		367	100.00	Using index for group-by

And now the aggregate:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    t_ban2 bd
        WHERE   banned
        GROUP BY
                ip
        HAVING COUNT(*) >= 5
        ) bi
JOIN    t_ban2 bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
4011
1 row fetched in 0.0001s (0.6406s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	system					1	100.00
1	PRIMARY	bo	ref	ix_ban2_ip_banned	ix_ban2_ip_banned	4	const	2894	100.00	Using where; Using index
2	DERIVED	bd	index		ix_ban2_ip_banned	5		1000642	100.00	Using where; Using index

As you can see, if there are few banned IP's, the COUNT becomes more efficent.

This is because aggregating a small resultset is faster than performing a self JOIN for each DISTINCT IP.

In t_ban2, there are too few IP's that have COUNT(*) < 5, so counting as such is not a problem anymore, and an extra JOIN is able to kill the performance benefits.

Summary:

There are two methods to filter out the IP's that have less than 5 banned records:

Finding 5th record for each DISTINCT IP using EXISTS
Aggregation and filtering on COUNT()'s

If there are many banned records, the first method is faster, since it doesn't need to count.

If there are few banned records, the second method is faster, since it doesn't need to JOIN.

Written by Quassnoi

June 18th, 2009 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Counting bans

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Counting bans

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow