EXPLAIN EXTENDED

How to create fast database queries

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

with 16 comments

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

Finally, it's MySQL time.

As always, we will create the sample tables:


Table creation details

Table t_left contains 100,000 rows with 10,000 distinct values.

Table t_right contains 1,000,000 rows with 10,000 distinct values.

There are 10 rows in t_left with values not present in t_right.

In both tables the field value is indexed.

LEFT JOIN / IS NULL

SELECT  l.id, l.value
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

View query results and execution plan

The query returns correct results in 0.73 seconds.

If we look into the Extra part of the execution plan, we will see an interesting thing there:

Using where; Using index; Not exists

What does it mean?

MySQL documentation mentions this:

Not exists

MySQL was able to do a LEFT JOIN optimization on the query and does not examine more rows in this table for the previous row combination after it finds one row that matches the LEFT JOIN criteria. Here is an example of the type of query that can be optimized this way:

SELECT  *
FROM    t1
LEFT JOIN
        t2
ON      t1.id = t2.id
WHERE   t2.id IS NULL;

Assume that t2.id is defined as NOT NULL. In this case, MySQL scans t1 and looks up the rows in t2 using the values of t1.id. If MySQL finds a matching row in t2, it knows that t2.id can never be NULL, and does not scan through the rest of the rows in t2 that have the same id value. In other words, for each row in t1, MySQL needs to do only a single lookup in t2, regardless of how many rows actually match in t2.

This is exactly our case.

MySQL, as well as all other systems except SQL Server, is able to optimize LEFT JOIN / IS NULL to return FALSE as soon the matching value is found, and it is the only system that cared to document this behavior.

The assumption that t2.id should be defined as NOT NULL, however, is too strong, since a successfull JOIN on equality condition implies that the value found is NOT NULL.

Since MySQL is not capable of using HASH and MERGE join algorithms, the only ANTI JOIN it is capable of is the NESTED LOOPS ANTI JOIN, which is exactly what we see in the query plan (despite the fact that MySQL doesn't call it that). However, this behavior is what an ANTI JOIN does: it checks the values from the left table against only one of each distinct values in the right table, skipping the duplicates.

NOT IN

SELECT  l.id, l.value
FROM    t_left l
WHERE   value NOT IN
        (
        SELECT  value
        FROM    t_right
        )

View query results and execution plan

This query is as fast as the LEFT JOIN / NOT NULL, however its plan looks quite different.

First, it mentions a DEPENDENT SUBQUERY instead of a second table (which was used in the LEFT JOIN / IS NULL). Nominally, is a dependent subquery indeed, since we don't have a join here but rather a SELECT from a single table with predicate in the WHERE clause.

Second, the description part of the plan mentions this:

<exists>(<index_lookup>(<cache>(`20090918_anti`.`l`.`value`) in t_right on ix_right_value))

What is that?

This is of course our good old friend, the ANTI JOIN. MySQL applies EXISTS optimization to the subquery: it uses the index scan over ix_right_value and returns as soon as it finds (or not finds) a row.

NOT IN is different in how it handles NULL values. However, since both t_left.value and t_right.value are defined as NOT NULL, no NULL value can ever be returned by this predicate, and MySQL takes this into account.

Essentially, this is exactly the same plan that LEFT JOIN / IS NULL uses, despite the fact these plans are executed by the different branches of code and they look different in the results of EXPLAIN. The algorithms are in fact the same in fact and the queries complete in same time.

NOT EXISTS

SELECT  l.id, l.value
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  value
        FROM    t_right r
        WHERE   r.value = l.value
        )

View query results and execution plan

This query of course produces the same results.

Execution plan, again, is different. MySQL is the only system that produces three different plans for three different methods.

The plan does not differ much: MySQL does know what an index lookup is and what EXISTS is and it does combine them together.

EXISTS in MySQL is optimized so that it returns as soon as the first value is found. So this query in fact is an ANTI JOIN as well as first two queries are.

This query, however, is a little bit less efficient than the previous two: it takes 0.92 s.

This is not much of a performance drop, however, the query takes 27% more time.

It's hard to tell exact reason for this, since this drop is linear and does not seem to depend on data distribution, number of values in both tables etc., as long as both fields are indexed. Since there are three pieces of code in MySQL that essentialy do one job, it is possible that the code responsible for EXISTS makes some kind of an extra check which takes extra time.

Summary

MySQL can optimize all three methods to do a sort of NESTED LOOPS ANTI JOIN.

It will take each value from t_left and look it up in the index on t_right.value. In case of an index hit or an index miss, the corresponding predicate will immediately return FALSE or TRUE, respectively, and the decision to return the row from t_left or not will be made immediately without examining other rows in t_right.

However, these three methods generate three different plans which are executed by three different pieces of code. The code that executes EXISTS predicate is about 30% less efficient than those that execute index_subquery and LEFT JOIN optimized to use Not exists method.

That's why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL or NOT IN rather than NOT EXISTS.

Written by Quassnoi

September 18th, 2009 at 11:00 pm

Posted in MySQL

16 Responses to 'NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL'

Subscribe to comments with RSS

  1. Great article, which I’ve often cited in answers on SO. However, I’m struggling to understand how your conclusion reconciles with [Optimizing Subqueries with `EXISTS` Strategy][1] which (to my reading) suggests that `NOT EXISTS` should be *more efficient* than `NOT IN`? Grateful for your thoughts.

    eggyal

    24 Dec 12 at 01:04

  2. Great article, which I’ve often cited in answers on SO. However, I’m struggling to understand how your conclusion reconciles with [Optimizing Subqueries with `EXISTS` Strategy][1] which (to my reading) suggests that `NOT EXISTS` should be *more efficient* than `NOT IN`? Grateful for your thoughts.

    [1]: http://dev.mysql.com/doc/en/subquery-optimization-with-exists.html

    eggyal

    24 Dec 12 at 01:05

  3. @eggyal: first, NOT EXISTS and EXISTS (anti-join and semi-join) are very different things using different plans. The former cannot be efficiently rewritten using the outer table leading, not with nested loops, the second can.

    Second, the article you’re linking concerns the difference in handling NULL values (EXISTS is bivalent, IN is trivalent). This is a very interesting subject however I did not cover it in this article.

    Quassnoi

    24 Dec 12 at 15:50

  4. Hi,

    Its a Great analysis, whichever i saw in forums. Its really practical examples are used and every one will understand easily.

    I were never used not existing, But i got a damn information from these article. really great and really appreciated…

    Thanks alot…

    Ashraf K C

    30 Jan 13 at 13:36

  5. what’s wrong with this?
    SELECT
    c.id_category,
    c.category
    FROM category as c
    LEFT JOIN item_category_assign as i ON i.id_category = c.id_category
    WHERE i.id_item IS NULL AND t.type = ‘digital’

    this return 0 rows, can you explain that?

    dhamaso

    19 Apr 13 at 01:19

  6. @dhamaso: what’s t.type? You don’t have a table called or aliased t in your query.

    Quassnoi

    19 Apr 13 at 01:21

  7. Sorry is i.type

    SELECT
    c.id_category,
    c.category
    FROM category as c
    LEFT JOIN item_category_assign as i ON i.id_category = c.id_category
    WHERE i.id_item IS NULL AND i.type = ‘digital’

    dhamaso

    31 May 13 at 22:12

  8. @dhamaso: move i.type into the ON clause:


    SELECT c.id_category, c.category
    FROM category c
    LEFT JOIN
    item_category_assign i
    ON i.id_category = c.id_category
    AND i.type = 'digital'
    WHERE i.id_item IS NULL

    When id_item is a null due the left join, all other fields are NULL as well.

    Quassnoi

    31 May 13 at 22:16

  9. look at the two sqls:

    select u.* from cms_points_member m,cms_users u left join cms_users_send_sms_queue q on u.users_id=q.users_id where u.relation_id=m.relation_id and q.users_id IS NULL and q.message_id=1

    Time:0.0008
    This sql has not result!

    SELECT u. *
    FROM cms_users u, cms_points_member m
    WHERE u.relation_id = m.relation_id
    AND NOT
    EXISTS (
    SELECT 1
    FROM cms_users_send_sms_queue q
    WHERE q.users_id = u.users_id
    AND q.message_id =1
    )
    Time:0.0012
    This sql has result !

    If the two sqls is the same meaning,who tell me why?

    tonnycao

    8 Jun 13 at 03:11

  10. OK,I know the error .

    SELECT u. *
    FROM cms_points_member m, cms_users u
    LEFT JOIN cms_users_send_sms_queue q ON u.users_id = q.users_id
    AND q.message_id =1
    WHERE u.relation_id = m.relation_id
    AND q.users_id IS NULL

    This is correct!

    Thinks blogger.

    tonnycao

    8 Jun 13 at 03:16

  11. “value NOT IN (select value etc.) ”
    is slow but it works great.
    thank you !

    michele

    2 Sep 13 at 19:55

  12. Thanks for your article~~!But I still wanna ask a question.Assume that t_left left join t_right by two conditions con1 and con2.There is an unique key created on con1 and con2 like unique key (con1,con2).So,will the execution plan still get a join type “ef_ref” for the following sql like :
    SELECT *
    FROM t1
    LEFT JOIN
    t2
    ON t1.con1=t2.con1 and t1.con2=t2.con2
    WHERE t2.con2 is NULL

    Chris

    16 Nov 13 at 10:32

  13. @Chris: yes, it will

    Quassnoi

    16 Nov 13 at 21:54

  14. I did the test I mentioned above.The execution plan did get a join type “ef_ref”.But you shoule make con1 and con2 the same data type in both t_left and t_right.For example, the same character set and collation when con1 and con2 are varchar

    Chris

    17 Nov 13 at 06:20

  15. Excuse me, a topic out, but Who is better between “in” and “not in” if the values in “in” are a lot than values in “not in”?
    thanks

    example

    1 case:field IN (3,6,7,8,44, …, 1052)

    2 case field NOT IN (1,2)

    oscar

    22 Nov 13 at 12:28

  16. @oscar: what do you mean by “who is better”? Please clarify your question.

    Quassnoi

    22 Nov 13 at 12:30

Leave a Reply