EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Things SQL needs: determining range cardinality. You're welcome to read and comment on it.

Counting missing rows: using DISTINCT

This is the 2nd of 5 articles covering implementation of NOT IN predicate in several RDBMS‘es:

Today, I will tell about some useful tricks that help to optimize these subqueries even more.

Let’s create some sample tables:

Table creation details

These tables are much like those we used yesterday: 1,000,000 rows in table1, 1,000 rows in table2.

But table1 has now one more field: grouper. This field holds values from 0 to 999, only 0 not being among id‘s from table2.

There are 1,000 records with grouper = 0 in table1.

Now, let’s try to COUNT(*) those rows from table1 whose grouper is never found among id‘s of table2.

As it’s only 0 that is never found in table2, there should be 1,000 such records.

Just like yesterday, we can use a NOT IN predicate:

SELECT  COUNT(*)
FROM    table1 t1
WHERE   t1.grouper NOT IN
        (
        SELECT  id
        FROM    table2
        )
COUNT(*)
1000
1 row fetched in 0.0001s (2.2812s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 index ix_table1_grouper 4 997915 100.00 Using where; Using index
2 DEPENDENT SUBQUERY table2 unique_subquery PRIMARY PRIMARY 4 func 1 100.00 Using index
select count(0) AS `COUNT(*)` from `20090419_in`.`table1` `t1` where (not(<in_optimizer>(`20090419_in`.`t1`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`20090419_in`.`t1`.`grouper`) in table2 on PRIMARY)))))

, which uses the same execution plan and even works for almost exactly same time.

As we have already figured out, the subquery is being optimized to a correlated subquery and executed 1,000,000 times.

When we searched for missing id‘s (that are UNIQUE within table1), the correlated subquery must have been run 1,000,000 times and not a single time less. Each of 1,000,000 values could either exist or not in table2, and 1,000,000 checks were indeed required.

But now we have only 1,000 DISTINCT grouper‘s in table1. Do we really still need to perform one million checks?

Unfortunately, MySQL cannot cache results of a dependent subquery. But with a little effort from our side, we can help it.

We can build our query in the following way:

  1. Select all DISTINCT grouper‘s from table1
  2. Find those grouper‘s that do not exist as id‘s in table2
  3. Join table1 on those grouper‘s

Here it is:

SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT grouper
        FROM    table1
        ) q
JOIN    table1 t1
ON      t1.grouper = q.grouper
WHERE   q.grouper NOT IN
        (
        SELECT  id
        FROM    table2
        )
COUNT(*)
1000
1 row fetched in 0.0001s (0.0185s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 1000 100.00 Using where
1 PRIMARY t1 ref ix_table1_grouper ix_table1_grouper 4 q.grouper 446 100.00 Using index
3 DEPENDENT SUBQUERY table2 unique_subquery PRIMARY PRIMARY 4 func 1 100.00 Using index
2 DERIVED table1 range ix_table1_grouper 4 2238 100.00 Using index for group-by
select count(0) AS `COUNT(*)` from (select distinct `20090419_in`.`table1`.`grouper` AS `grouper` from `20090419_in`.`table1`) `q` join `20090419_in`.`table1` `t1` where ((`20090419_in`.`t1`.`grouper` = `q`.`grouper`) and (not(<in_optimizer>(`q`.`grouper`,<exists>(<primary_index_lookup>(<cache>(`q`.`grouper`) in table2 on PRIMARY))))))

We see that INDEX FOR GROUP-BY on table1 is used to select DISTINCT grouper‘s from table1. This is quite an efficient method for selecting distinct values.

Then the DEPENDENT SUBQUERY is run for each DISTINCT grouper, there being as few as 1,000 of them.

Finally, INDEX RANGE SCAN on table1 is performed for every missing grouper, and the rows are counted.

The query completes in only 18 ms, which is faster by orders of magnitude.

Note that I recreated the tables in InnoDB, and this solutions does not use the arcane tricks like COUNT(*) in metadata: just indexes, DISTINCT‘s and other honorable features of a well respected RDBMS.

To be continued.

Written by Quassnoi

April 19th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.