Counting missing rows: using DISTINCT
This is the 2nd of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:
NOT INin MySQLNOT INin MySQL (usingDISTINCT)NOT INin SQL ServerNOT INin OracleNOT INin PostgreSQL
Today, I will tell about some useful tricks that help to optimize these subqueries even more.
Let’s create some sample tables:
Table creation details
These tables are much like those we used yesterday: 1,000,000 rows in table1, 1,000 rows in table2.
But table1 has now one more field: grouper. This field holds values from 0 to 999, only 0 not being among id‘s from table2.
There are 1,000 records with grouper = 0 in table1.
Now, let’s try to COUNT(*) those rows from table1 whose grouper is never found among id‘s of table2.
As it’s only 0 that is never found in table2, there should be 1,000 such records.
Just like yesterday, we can use a NOT IN predicate:
SELECT COUNT(*)
FROM table1 t1
WHERE t1.grouper NOT IN
(
SELECT id
FROM table2
)
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (2.2812s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | t1 | index | ix_table1_grouper | 4 | 997915 | 100.00 | Using where; Using index | ||
| 2 | DEPENDENT SUBQUERY | table2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
select count(0) AS `COUNT(*)` from `20090419_in`.`table1` `t1` where (not(<in_optimizer>(`20090419_in`.`t1`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`20090419_in`.`t1`.`grouper`) in table2 on PRIMARY)))))
, which uses the same execution plan and even works for almost exactly same time.
As we have already figured out, the subquery is being optimized to a correlated subquery and executed 1,000,000 times.
When we searched for missing id‘s (that are UNIQUE within table1), the correlated subquery must have been run 1,000,000 times and not a single time less. Each of 1,000,000 values could either exist or not in table2, and 1,000,000 checks were indeed required.
But now we have only 1,000 DISTINCT grouper‘s in table1. Do we really still need to perform one million checks?
Unfortunately, MySQL cannot cache results of a dependent subquery. But with a little effort from our side, we can help it.
We can build our query in the following way:
- Select all
DISTINCT grouper‘s fromtable1 - Find those
grouper‘s that do not exist asid‘s intable2 - Join
table1on thosegrouper‘s
Here it is:
SELECT COUNT(*)
FROM (
SELECT DISTINCT grouper
FROM table1
) q
JOIN table1 t1
ON t1.grouper = q.grouper
WHERE q.grouper NOT IN
(
SELECT id
FROM table2
)
| COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0001s (0.0185s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | ALL | 1000 | 100.00 | Using where | ||||
| 1 | PRIMARY | t1 | ref | ix_table1_grouper | ix_table1_grouper | 4 | q.grouper | 446 | 100.00 | Using index |
| 3 | DEPENDENT SUBQUERY | table2 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using index |
| 2 | DERIVED | table1 | range | ix_table1_grouper | 4 | 2238 | 100.00 | Using index for group-by |
select count(0) AS `COUNT(*)` from (select distinct `20090419_in`.`table1`.`grouper` AS `grouper` from `20090419_in`.`table1`) `q` join `20090419_in`.`table1` `t1` where ((`20090419_in`.`t1`.`grouper` = `q`.`grouper`) and (not(<in_optimizer>(`q`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`q`.`grouper`) in table2 on PRIMARY))))))
We see that INDEX FOR GROUP-BY on table1 is used to select DISTINCT grouper‘s from table1. This is quite an efficient method for selecting distinct values.
Then the DEPENDENT SUBQUERY is run for each DISTINCT grouper, there being as few as 1,000 of them.
Finally, INDEX RANGE SCAN on table1 is performed for every missing grouper, and the rows are counted.
The query completes in only 18 ms, which is faster by orders of magnitude.
Note that I recreated the tables in InnoDB, and this solutions does not use the arcane tricks like COUNT(*) in metadata: just indexes, DISTINCT‘s and other honorable features of a well respected RDBMS.
To be continued.







Subscribe in a reader