Counting missing rows at EXPLAIN EXTENDED

In my first programming job I was told that a query within the IN() predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.

For example, given the query:
SELECT  COUNT(*)
FROM    Table1
WHERE   Table1Id NOT IN
	(
	SELECT	Table1Id
	FROM	Table2
	WHERE	id_user = 1
	)
Rows Inner subquery executions

10 10

100 100

1000 1000

10000 10000

Is this correct? How does the IN predicate actually work?

Rows	Inner subquery executions
10	10
100	100
1000	1000
10000	10000

This is the 1^st of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

NOT IN in MySQL
NOT IN in MySQL (using DISTINCT)
NOT IN in SQL Server
NOT IN in Oracle
NOT IN in PostgreSQL

In this article I will describe how MySQL implements this predicate.

Let's create the sample tables:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE table1 (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
	value VARCHAR(50) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE table2 (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
	value VARCHAR(50) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO	table1 (id, value)
SELECT	id, CONCAT('Value One ', id)
FROM	filler;

INSERT
INTO	table2 (id, value)
SELECT	id, CONCAT('Value Two ', id)
FROM	filler
ORDER BY
	id
LIMIT	1000;
&#91;/sourcecode&#93;
</div>

There are <strong>1,000,000</strong> rows in <code>table1</code> and <strong>1,000</strong> rows in <code>table2</code>.

Now, let's see the execution plan for this query:


SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.id NOT IN
        (
        SELECT  id
        FROM    table2 t2
        )

COUNT(*)
999000
1 row fetched in 0.0002s (4.0312s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t1	index		PRIMARY	4		1000000	100.00	Using where; Using index
2	DEPENDENT SUBQUERY	t2	unique_subquery	PRIMARY	PRIMARY	4	func	1	100.00	Using index

select count(0) AS `COUNT(*)` from `20090418_in`.`table1` `t1` where (not(<in_optimizer>(`20090418_in`.`t1`.`id`,<exists>(<primary_index_lookup>(<cache>(`20090418_in`.`t1`.`id`) in table2 on PRIMARY)))))

We see that there is a DEPENDENT SUBQUERY in the plan, but we see no correlated subqueries. What is this DEPENDENT SUBQUERY and what does it depend on?

If we look into the notice returned by EXPLAIN EXTENDED, we can see that the subquery is optimized to a NOT EXISTS subquery using a correlated value:

SELECT  COUNT(*)
FROM    table1
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    table2
        WHERE   table2.id = table1.id
        )

For each record from table1, MySQL engine executes the following correlated subquery:

SELECT  1
FROM    table2
WHERE   table2.id = table1.id
LIMIT 1

and returns the record into the resultset only if this correlated subquery does not return any rows.

The latter subquery really depends on the values from the outer space, that's why EXPLAIN EXTENDED describes it as a dependent subquery.

So is it really correct that the NOT IN subquery is executed as many times as there are rows in table1?

Yes and no.

Yes, because some subquery is really executed 1,000,000 times.

No, because in fact it's another subquery. Instead of selecting whole 1,000 of rows from table2, on each step MySQL selects at most 1 record from table2 (using the index lookup), and doesn't even use the selected value: it just checks if it exists.

And now a bonus trick.

We select COUNT(*) of those records from table1 whose id's do not exist in table2. There are 1,000,000 rows in table1, and even with these optimizations, it takes a lot of time:

SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.id NOT IN
        (
        SELECT  id
        FROM    table2
        )

COUNT(*)
999000
1 rows fetched in 0.0011s (4.6406s)

More than 4 seconds.

But there are only 1,000 rows in table2.

What if we COUNT(*) the values from table1 that do exist in table2?

It can be done in a simple join with table2 as a leading table, 1,000 index lookups instead of a 1,000,000. And then we just subtract this value from the total COUNT(*) of values in table1, which is stored in MyISAM's metadata and can be fetched instantly.

Let's try it:

SELECT  (
        SELECT  COUNT(*)
        FROM    table1 t1
        ) -
        (
        SELECT  COUNT(*)
        FROM    table2 t1, table1 t2
        WHERE   t2.id = t1.id
        ) AS count

count
999000
1 rows fetched in 0.0012s (0.0077s)

Much faster, I should say.

To be continued.

Written by Quassnoi

April 18th, 2009 at 11:00 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

EXPLAIN EXTENDED