EXPLAIN EXTENDED

How to create fast database queries

Counting bans

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a MyISAM table in MySQL 5.0 like this:

id login ip banned

I would like to find all users not banned (banned = 0) if at least 5 other users with the same ip have already been banned (banned = 1).

To do this, we should first find which IP's have more than 5 bans and then find all non-banned users for these IP's.

To find the IP's we can use two methods:

  1. Find all banned IP's, group them and filter out those that have COUNT(*) < 5
  2. For each IP possible, find if there are 5 banned records for this IP using a LIMIT clause

The first method doesn't require doing an extra JOIN on the table, but it will do count all values just to figure out if the COUNT(*) is more or less than 5.

The second method stops counting whenever it find the 5th value, but it requires an extra JOIN.

Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN loops.

This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP's.

Let's create two sample tables and see:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_ban (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        banned TINYINT NOT NULL,
        username VARCHAR(20) NOT NULL,
        KEY ix_ban_ip_banned (ip, banned),
        KEY ix_ban_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

CREATE TABLE t_ban2 (
        id INT NOT NULL PRIMARY KEY,
        ip INT NOT NULL,
        banned TINYINT NOT NULL,
        username VARCHAR(20) NOT NULL,
        KEY ix_ban2_ip_banned (ip, banned),
        KEY ix_ban2_banned (banned)
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_ban (id, ip, banned, username)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090618) * 2048),
        RAND(20090618 << 1) < 0.99,
        CONCAT('User ', id)
FROM    filler;

INSERT
INTO    t_ban2 (id, ip, banned, username)
SELECT  id,
        INET_ATON('10.0.0.0') + FLOOR(RAND(20090618 << 2) * 256),
        RAND(20090618 << 3) < 0.0003,
        CONCAT('User ', id)
FROM    filler;
&#91;/sourcecode&#93;

This creates two tables: <code>t_ban</code> and <code>t_ban2</code>.

	<li><code>t_ban</code>'s <code>IP</code>'s are randomly disributed over the range of <code>2048</code> values and there are <code>99%</code> banned records in <code>t_ban</code>
</li>
<code>t_ban2</code>'s <code>IP</code>'s are randomly disributed over the range of <code>246</code> values and there are <code>0,3%</code> banned records in <code>t_ban</code>.

In other words, almost everybody is <strong>banned</strong> in t_ban, and almost everybody is not banned in <code>t_ban2</code>.

Now, let's run the queries against the first table:


SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    (
                SELECT  DISTINCT ip
                FROM    t_ban
                ) bd
        WHERE   EXISTS
                (
                SELECT  1
                FROM    t_ban bb
                WHERE   bb.ip = bd.ip
                        AND banned
                LIMIT 4, 1
                )
        ) bi
JOIN    t_ban bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
9961
1 row fetched in 0.0001s (0.7969s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 2048 100.00
1 PRIMARY bo ref ix_ban_ip_banned ix_ban_ip_banned 4 bi.ip 214 100.00 Using where; Using index
2 DERIVED <derived3> ALL 2048 100.00 Using where
4 DEPENDENT SUBQUERY bb ref ix_ban_ip_banned ix_ban_ip_banned 4 bd.ip 214 100.00 Using where; Using index
3 DERIVED t_ban range ix_ban_ip_banned 4 4661 100.00 Using index for group-by

This query uses the first method:

  • It builds a set of DISTINCT IP's
  • For each IP it checks if there is 5th banned record with this IP
  • It joins the resultset of the found IP's back with the table

This query runs for 0,79 s.

Same using a GROUP BY:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    t_ban bd
        WHERE   banned
        GROUP BY
                ip
        HAVING COUNT(*) >= 5
        ) bi
JOIN    t_ban bo
ON      bo.ip = bi.ip
        AND NOT bo.banned


COUNT(*)
9961
1 row fetched in 0.0001s (1.4063s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 2048 100.00
1 PRIMARY bo ref ix_ban_ip_banned ix_ban_ip_banned 4 bi.ip 230 100.00 Using where; Using index
2 DERIVED bd index ix_ban_ip_banned 5 997469 100.00 Using where; Using index

This does the same, but counts all values.

This query is two times as slow as the EXISTS one: there are lots of values to COUNT.

Now, let's check the same for the second table (which has few banned users).

First, the EXISTS query:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    (
                SELECT  DISTINCT ip
                FROM    t_ban2
                ) bd
        WHERE   EXISTS
                (
                SELECT  1
                FROM    t_ban2 bb
                WHERE   bb.ip = bd.ip
                        AND banned
                LIMIT 4, 1
                )
        ) bi
JOIN    t_ban2 bo
ON      bo.ip = bi.ip
        AND NOT bo.banned

COUNT(*)
4011
1 row fetched in 0.0001s (0.8875s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> system 1 100.00
1 PRIMARY bo ref ix_ban2_ip_banned ix_ban2_ip_banned 4 const 2894 100.00 Using where; Using index
2 DERIVED <derived3> ALL 256 100.00 Using where
4 DEPENDENT SUBQUERY bb ref ix_ban2_ip_banned ix_ban2_ip_banned 4 bd.ip 2733 100.00 Using where; Using index
3 DERIVED t_ban2 range ix_ban2_ip_banned 4 367 100.00 Using index for group-by

And now the aggregate:

SELECT  COUNT(*)
FROM    (
        SELECT  ip
        FROM    t_ban2 bd
        WHERE   banned
        GROUP BY
                ip
        HAVING COUNT(*) >= 5
        ) bi
JOIN    t_ban2 bo
ON      bo.ip = bi.ip
        AND NOT bo.banned


COUNT(*)
4011
1 row fetched in 0.0001s (0.6406s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> system 1 100.00
1 PRIMARY bo ref ix_ban2_ip_banned ix_ban2_ip_banned 4 const 2894 100.00 Using where; Using index
2 DERIVED bd index ix_ban2_ip_banned 5 1000642 100.00 Using where; Using index

As you can see, if there are few banned IP's, the COUNT becomes more efficent.

This is because aggregating a small resultset is faster than performing a self JOIN for each DISTINCT IP.

In t_ban2, there are too few IP's that have COUNT(*) < 5, so counting as such is not a problem anymore, and an extra JOIN is able to kill the performance benefits.

Summary:

There are two methods to filter out the IP's that have less than 5 banned records:

  1. Finding 5th record for each DISTINCT IP using EXISTS
  2. Aggregation and filtering on COUNT()'s

If there are many banned records, the first method is faster, since it doesn't need to count.

If there are few banned records, the second method is faster, since it doesn't need to JOIN.

Written by Quassnoi

June 18th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply