Counting missing rows: using DISTINCT
Comments enabled. I *really* need your comment
This is the 2nd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Today, I will tell about some useful tricks that help to optimize these subqueries even more.
Let's create some sample tables:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE table1 ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, grouper INT NOT NULL, value VARCHAR(50) NOT NULL, KEY ix_table1_grouper (grouper) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE table2 ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, value VARCHAR(50) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO table1 (id, grouper, value) SELECT id, id % 1000, CONCAT('Value One ', id) FROM filler; INSERT INTO table2 (id, value) SELECT id, CONCAT('Value Two ', id) FROM filler ORDER BY id LIMIT 1000; [/sourcecode] </div> These tables are much like those we used <a href="/?p=928">yesterday</a>: <strong>1,000,000</strong> rows in <code>table1</code>, <strong>1,000</strong> rows in <code>table2</code>. But <code>table1</code> has now one more field: <code>grouper</code>. This field holds values from <strong>0</strong> to <strong>999</strong>, only <strong>0</strong> not being among <code>id</code>'s from <code>table2</code>. There are <strong>1,000</strong> records with <code>grouper = 0</code> in <code>table1</code>. Now, let's try to <code>COUNT(*)</code> those rows from <code>table1</code> whose <code>grouper</code> is never found among <code>id</code>'s of <code>table2</code>. As it's only <strong>0</strong> that is never found in <code>table2</code>, there should be <code>1,000</code> such records. Just like <a href="/?p=928">yesterday</a>, we can use a <code>NOT IN</code> predicate: SELECT COUNT(*) FROM table1 t1 WHERE t1.grouper NOT IN ( SELECT id FROM table2 )
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.2812s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | t1 | index | ix_table1_grouper | 4 | 997915 | 100.00 | Using where; Using index | ||
2 | DEPENDENT SUBQUERY | table2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
select count(0) AS `COUNT(*)` from `20090419_in`.`table1` `t1` where (not(<in_optimizer>(`20090419_in`.`t1`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`20090419_in`.`t1`.`grouper`) in table2 on PRIMARY)))))
, which uses the same execution plan and even works for almost exactly same time.
As we have already figured out, the subquery is being optimized to a correlated subquery and executed 1,000,000 times.
When we searched for missing id
's (that are UNIQUE
within table1
), the correlated subquery must have been run 1,000,000 times and not a single time less. Each of 1,000,000 values could either exist or not in table2, and 1,000,000 checks were indeed required.
But now we have only 1,000 DISTINCT grouper
's in table1
. Do we really still need to perform one million checks?
Unfortunately, MySQL
cannot cache results of a dependent subquery. But with a little effort from our side, we can help it.
We can build our query in the following way:
- Select all
DISTINCT grouper
's fromtable1
- Find those
grouper
's that do not exist asid
's intable2
- Join
table1
on thosegrouper
's
Here it is:
SELECT COUNT(*) FROM ( SELECT DISTINCT grouper FROM table1 ) q JOIN table1 t1 ON t1.grouper = q.grouper WHERE q.grouper NOT IN ( SELECT id FROM table2 )
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0185s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 1000 | 100.00 | Using where | ||||
1 | PRIMARY | t1 | ref | ix_table1_grouper | ix_table1_grouper | 4 | q.grouper | 446 | 100.00 | Using index |
3 | DEPENDENT SUBQUERY | table2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
2 | DERIVED | table1 | range | ix_table1_grouper | 4 | 2238 | 100.00 | Using index for group-by |
select count(0) AS `COUNT(*)` from (select distinct `20090419_in`.`table1`.`grouper` AS `grouper` from `20090419_in`.`table1`) `q` join `20090419_in`.`table1` `t1` where ((`20090419_in`.`t1`.`grouper` = `q`.`grouper`) and (not(<in_optimizer>(`q`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`q`.`grouper`) in table2 on PRIMARY))))))
We see that INDEX FOR GROUP-BY
on table1
is used to select DISTINCT grouper
's from table1
. This is quite an efficient method for selecting distinct values.
Then the DEPENDENT SUBQUERY
is run for each DISTINCT grouper
, there being as few as 1,000 of them.
Finally, INDEX RANGE SCAN
on table1
is performed for every missing grouper
, and the rows are counted.
The query completes in only 18 ms, which is faster by orders of magnitude.
Note that I recreated the tables in InnoDB
, and this solutions does not use the arcane tricks like COUNT(*)
in metadata: just indexes, DISTINCT
's and other honorable features of a well respected RDBMS.
To be continued.
Written by Quassnoi
April 19th, 2009 at 11:00 pm
Posted in MySQL
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009