MySQL: improving NOT IN predicate using DISTINCT at EXPLAIN EXTENDED

Counting missing rows: using DISTINCT

Comments enabled. I *really* need your comment

This is the 2^nd of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

NOT IN in MySQL
NOT IN in MySQL (using DISTINCT)
NOT IN in SQL Server
NOT IN in Oracle
NOT IN in PostgreSQL

Today, I will tell about some useful tricks that help to optimize these subqueries even more.

Let's create some sample tables:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE table1 (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
        grouper INT NOT NULL,
        value VARCHAR(50) NOT NULL,
        KEY ix_table1_grouper (grouper)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE table2 (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
        value VARCHAR(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    table1 (id, grouper, value)
SELECT  id, id % 1000, CONCAT('Value One ', id)
FROM    filler;

INSERT
INTO    table2 (id, value)
SELECT  id, CONCAT('Value Two ', id)
FROM    filler
ORDER BY
        id
LIMIT   1000;
&#91;/sourcecode&#93;
</div>

These tables are much like those we used <a href="/?p=928">yesterday</a>: <strong>1,000,000</strong> rows in <code>table1</code>, <strong>1,000</strong> rows in <code>table2</code>.

But <code>table1</code> has now one more field: <code>grouper</code>. This field holds values from <strong>0</strong> to <strong>999</strong>, only <strong>0</strong> not being among <code>id</code>'s from <code>table2</code>.

There are <strong>1,000</strong> records with <code>grouper = 0</code> in <code>table1</code>.

Now, let's try to <code>COUNT(*)</code> those rows from <code>table1</code> whose <code>grouper</code> is never found among <code>id</code>'s of <code>table2</code>.

As it's only <strong>0</strong> that is never found in <code>table2</code>, there should be <code>1,000</code> such records.

Just like <a href="/?p=928">yesterday</a>, we can use a <code>NOT IN</code> predicate:


SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.grouper NOT IN
        (
        SELECT  id
        FROM    table2
        )

COUNT(*)
1000
1 row fetched in 0.0001s (2.2812s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t1	index		ix_table1_grouper	4		997915	100.00	Using where; Using index
2	DEPENDENT SUBQUERY	table2	unique_subquery	PRIMARY	PRIMARY	4	func	1	100.00	Using index

select count(0) AS `COUNT(*)` from `20090419_in`.`table1` `t1` where (not(<in_optimizer>(`20090419_in`.`t1`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`20090419_in`.`t1`.`grouper`) in table2 on PRIMARY)))))

, which uses the same execution plan and even works for almost exactly same time.

As we have already figured out, the subquery is being optimized to a correlated subquery and executed 1,000,000 times.

When we searched for missing id's (that are UNIQUE within table1), the correlated subquery must have been run 1,000,000 times and not a single time less. Each of 1,000,000 values could either exist or not in table2, and 1,000,000 checks were indeed required.

But now we have only 1,000 DISTINCT grouper's in table1. Do we really still need to perform one million checks?

Unfortunately, MySQL cannot cache results of a dependent subquery. But with a little effort from our side, we can help it.

We can build our query in the following way:

Select all DISTINCT grouper's from table1
Find those grouper's that do not exist as id's in table2
Join table1 on those grouper's

Here it is:

SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT grouper
        FROM    table1
        ) q
JOIN    table1 t1
ON      t1.grouper = q.grouper
WHERE   q.grouper NOT IN
        (
        SELECT  id
        FROM    table2
        )

COUNT(*)
1000
1 row fetched in 0.0001s (0.0185s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					1000	100.00	Using where
1	PRIMARY	t1	ref	ix_table1_grouper	ix_table1_grouper	4	q.grouper	446	100.00	Using index
3	DEPENDENT SUBQUERY	table2	unique_subquery	PRIMARY	PRIMARY	4	func	1	100.00	Using index
2	DERIVED	table1	range		ix_table1_grouper	4		2238	100.00	Using index for group-by

select count(0) AS `COUNT(*)` from (select distinct `20090419_in`.`table1`.`grouper` AS `grouper` from `20090419_in`.`table1`) `q` join `20090419_in`.`table1` `t1` where ((`20090419_in`.`t1`.`grouper` = `q`.`grouper`) and (not(<in_optimizer>(`q`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`q`.`grouper`) in table2 on PRIMARY))))))

We see that INDEX FOR GROUP-BY on table1 is used to select DISTINCT grouper's from table1. This is quite an efficient method for selecting distinct values.

Then the DEPENDENT SUBQUERY is run for each DISTINCT grouper, there being as few as 1,000 of them.

Finally, INDEX RANGE SCAN on table1 is performed for every missing grouper, and the rows are counted.

The query completes in only 18 ms, which is faster by orders of magnitude.

Note that I recreated the tables in InnoDB, and this solutions does not use the arcane tricks like COUNT(*) in metadata: just indexes, DISTINCT's and other honorable features of a well respected RDBMS.

To be continued.

Written by Quassnoi

April 19th, 2009 at 11:00 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

EXPLAIN EXTENDED

Counting missing rows: using DISTINCT

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Counting missing rows: using DISTINCT

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow