Counting missing rows
Comments enabled. I *really* need your comment
From Stack Overflow:
In my first programming job I was told that a query within the
IN()
predicate gets executed for every row contained in the parent query, and therefore usingIN
should be avoided.For example, given the query:
SELECT COUNT(*) FROM Table1 WHERE Table1Id NOT IN ( SELECT Table1Id FROM Table2 WHERE id_user = 1 )
Rows Inner subquery executions 10 10 100 100 1000 1000 10000 10000 Is this correct? How does the
IN
predicate actually work?
This is the 1st of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
In this article I will describe how MySQL implements this predicate.
Let's create the sample tables:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE table1 ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, value VARCHAR(50) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE table2 ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, value VARCHAR(50) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO table1 (id, value) SELECT id, CONCAT('Value One ', id) FROM filler; INSERT INTO table2 (id, value) SELECT id, CONCAT('Value Two ', id) FROM filler ORDER BY id LIMIT 1000; [/sourcecode] </div> There are <strong>1,000,000</strong> rows in <code>table1</code> and <strong>1,000</strong> rows in <code>table2</code>. Now, let's see the execution plan for this query: SELECT COUNT(*) FROM table1 t1 WHERE t1.id NOT IN ( SELECT id FROM table2 t2 )
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (4.0312s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | t1 | index | PRIMARY | 4 | 1000000 | 100.00 | Using where; Using index | ||
2 | DEPENDENT SUBQUERY | t2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
select count(0) AS `COUNT(*)` from `20090418_in`.`table1` `t1` where (not(<in_optimizer>(`20090418_in`.`t1`.`id`,<exists>(<primary_index_lookup>(<cache>(`20090418_in`.`t1`.`id`) in table2 on PRIMARY)))))
We see that there is a DEPENDENT SUBQUERY
in the plan, but we see no correlated subqueries. What is this DEPENDENT SUBQUERY
and what does it depend on?
If we look into the notice returned by EXPLAIN EXTENDED
, we can see that the subquery is optimized to a NOT EXISTS
subquery using a correlated value:
SELECT COUNT(*) FROM table1 WHERE NOT EXISTS ( SELECT 1 FROM table2 WHERE table2.id = table1.id )
For each record from table1
, MySQL engine executes the following correlated subquery:
SELECT 1 FROM table2 WHERE table2.id = table1.id LIMIT 1
and returns the record into the resultset only if this correlated subquery does not return any rows.
The latter subquery really depends on the values from the outer space, that's why EXPLAIN EXTENDED
describes it as a dependent subquery.
So is it really correct that the NOT IN
subquery is executed as many times as there are rows in table1
?
Yes and no.
Yes, because some subquery is really executed 1,000,000 times.
No, because in fact it's another subquery. Instead of selecting whole 1,000 of rows from table2
, on each step MySQL selects at most 1 record from table2
(using the index lookup), and doesn't even use the selected value: it just checks if it exists.
And now a bonus trick.
We select COUNT(*)
of those records from table1
whose id
's do not exist in table2
. There are 1,000,000
rows in table1
, and even with these optimizations, it takes a lot of time:
SELECT COUNT(*) FROM table1 t1 WHERE t1.id NOT IN ( SELECT id FROM table2 )
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 rows fetched in 0.0011s (4.6406s) |
More than 4 seconds.
But there are only 1,000 rows in table2
.
What if we COUNT(*)
the values from table1
that do exist in table2
?
It can be done in a simple join with table2
as a leading table, 1,000 index lookups instead of a 1,000,000. And then we just subtract this value from the total COUNT(*)
of values in table1
, which is stored in MyISAM's metadata and can be fetched instantly.
Let's try it:
SELECT ( SELECT COUNT(*) FROM table1 t1 ) - ( SELECT COUNT(*) FROM table2 t1, table1 t2 WHERE t2.id = t1.id ) AS count
count | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 rows fetched in 0.0012s (0.0077s) |
Much faster, I should say.
To be continued.
Written by Quassnoi
April 18th, 2009 at 11:00 pm
Posted in MySQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009