EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Things SQL needs: determining range cardinality. You're welcome to read and comment on it.

Counting missing rows

From Stack Overflow:

In my first programming job I was told that a query within the IN() predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.

For example, given the query:

SELECT  COUNT(*)
FROM    Table1
WHERE   Table1Id NOT IN
	(
	SELECT	Table1Id
	FROM	Table2
	WHERE	id_user = 1
	)
Rows Inner subquery executions
10 10
100 100
1000 1000
10000 10000

Is this correct? How does the IN predicate actually work?

This is the 1st of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:

In this article I will describe how MySQL implements this predicate.

Let’s create the sample tables:

Table creation details

There are 1,000,000 rows in table1 and 1,000 rows in table2.

Now, let’s see the execution plan for this query:

SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.id NOT IN
        (
        SELECT  id
        FROM    table2 t2
        )
COUNT(*)
999000
1 row fetched in 0.0002s (4.0312s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 index PRIMARY 4 1000000 100.00 Using where; Using index
2 DEPENDENT SUBQUERY t2 unique_subquery PRIMARY PRIMARY 4 func 1 100.00 Using index
select count(0) AS `COUNT(*)` from `20090418_in`.`table1` `t1` where (not(<in_optimizer>(`20090418_in`.`t1`.`id`,<exists>(<primary_index_lookup>(<cache>(`20090418_in`.`t1`.`id`) in table2 on PRIMARY)))))

We see that there is a DEPENDENT SUBQUERY in the plan, but we see no correlated subqueries. What is this DEPENDENT SUBQUERY and what does it depend on?

If we look into the notice returned by EXPLAIN EXTENDED, we can see that the subquery is optimized to a NOT EXISTS subquery using a correlated value:

SELECT  COUNT(*)
FROM    table1
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    table2
        WHERE   table2.id = table1.id
        )

For each record from table1, MySQL engine executes the following correlated subquery:

SELECT  1
FROM    table2
WHERE   table2.id = table1.id
LIMIT 1

and returns the record into the resultset only if this correlated subquery does not return any rows.

The latter subquery really depends on the values from the outer space, that’s why EXPLAIN EXTENDED describes it as a dependent subquery.

So is it really correct that the NOT IN subquery is executed as many times as there are rows in table1?

Yes and no.

Yes, because some subquery is really executed 1,000,000 times.

No, because in fact it’s another subquery. Instead of selecting whole 1,000 of rows from table2, on each step MySQL selects at most 1 record from table2 (using the index lookup), and doesn’t even use the selected value: it just checks if it exists.

And now a bonus trick.

We select COUNT(*) of those records from table1 whose id‘s do not exist in table2. There are 1,000,000 rows in table1, and even with these optimizations, it takes a lot of time:

SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.id NOT IN
        (
        SELECT  id
        FROM    table2
        )
COUNT(*)
999000
1 rows fetched in 0.0011s (4.6406s)

More than 4 seconds.

But there are only 1,000 rows in table2.

What if we COUNT(*) the values from table1 that do exist in table2?

It can be done in a simple join with table2 as a leading table, 1,000 index lookups instead of a 1,000,000. And then we just subtract this value from the total COUNT(*) of values in table1, which is stored in MyISAM‘s metadata and can be fetched instantly.

Let’s try it:

SELECT  (
        SELECT  COUNT(*)
        FROM    table1 t1
        ) -
        (
        SELECT  COUNT(*)
        FROM    table2 t1, table1 t2
        WHERE   t2.id = t1.id
        ) AS count
count
999000
1 rows fetched in 0.0012s (0.0077s)

Much faster, I should say.

To be continued.

Written by Quassnoi

April 18th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.