Counting missing rows
From Stack Overflow:
In my first programming job I was told that a query within the
IN()predicate gets executed for every row contained in the parent query, and therefore usingINshould be avoided.For example, given the query:
SELECT COUNT(*) FROM Table1 WHERE Table1Id NOT IN ( SELECT Table1Id FROM Table2 WHERE id_user = 1 )
Rows Inner subquery executions 10 10 100 100 1000 1000 10000 10000 Is this correct? How does the
INpredicate actually work?
This is the 1st of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:
NOT INin MySQLNOT INin MySQL (usingDISTINCT)NOT INin SQL ServerNOT INin OracleNOT INin PostgreSQL
In this article I will describe how MySQL implements this predicate.
Let’s create the sample tables:
Table creation details
There are 1,000,000 rows in table1 and 1,000 rows in table2.
Now, let’s see the execution plan for this query:
SELECT COUNT(*)
FROM table1 t1
WHERE t1.id NOT IN
(
SELECT id
FROM table2 t2
)
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0002s (4.0312s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | t1 | index | PRIMARY | 4 | 1000000 | 100.00 | Using where; Using index | ||
| 2 | DEPENDENT SUBQUERY | t2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
select count(0) AS `COUNT(*)` from `20090418_in`.`table1` `t1` where (not(<in_optimizer>(`20090418_in`.`t1`.`id`,<exists>(<primary_index_lookup>(<cache>(`20090418_in`.`t1`.`id`) in table2 on PRIMARY)))))
We see that there is a DEPENDENT SUBQUERY in the plan, but we see no correlated subqueries. What is this DEPENDENT SUBQUERY and what does it depend on?
If we look into the notice returned by EXPLAIN EXTENDED, we can see that the subquery is optimized to a NOT EXISTS subquery using a correlated value:
SELECT COUNT(*)
FROM table1
WHERE NOT EXISTS
(
SELECT 1
FROM table2
WHERE table2.id = table1.id
)
For each record from table1, MySQL engine executes the following correlated subquery:
SELECT 1 FROM table2 WHERE table2.id = table1.id LIMIT 1
and returns the record into the resultset only if this correlated subquery does not return any rows.
The latter subquery really depends on the values from the outer space, that’s why EXPLAIN EXTENDED describes it as a dependent subquery.
So is it really correct that the NOT IN subquery is executed as many times as there are rows in table1?
Yes and no.
Yes, because some subquery is really executed 1,000,000 times.
No, because in fact it’s another subquery. Instead of selecting whole 1,000 of rows from table2, on each step MySQL selects at most 1 record from table2 (using the index lookup), and doesn’t even use the selected value: it just checks if it exists.
And now a bonus trick.
We select COUNT(*) of those records from table1 whose id‘s do not exist in table2. There are 1,000,000 rows in table1, and even with these optimizations, it takes a lot of time:
SELECT COUNT(*)
FROM table1 t1
WHERE t1.id NOT IN
(
SELECT id
FROM table2
)
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 rows fetched in 0.0011s (4.6406s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
More than 4 seconds.
But there are only 1,000 rows in table2.
What if we COUNT(*) the values from table1 that do exist in table2?
It can be done in a simple join with table2 as a leading table, 1,000 index lookups instead of a 1,000,000. And then we just subtract this value from the total COUNT(*) of values in table1, which is stored in MyISAM‘s metadata and can be fetched instantly.
Let’s try it:
SELECT (
SELECT COUNT(*)
FROM table1 t1
) -
(
SELECT COUNT(*)
FROM table2 t1, table1 t2
WHERE t2.id = t1.id
) AS count
| count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 rows fetched in 0.0012s (0.0077s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Much faster, I should say.
To be continued.







Subscribe in a reader