IN list vs. range condition: MySQL at EXPLAIN EXTENDED

IN list vs. range condition: MySQL

Answering questions asked on the site.

Princess asks:

Hello, I've got a problem with SQL.

I have a table which contains the production details for the factories. They are not factories and items of course but I cannot disclose the project and need to obfuscate so let's pretend they are :)

I need to select the items for which the first 5 factories have low production rate.

I tried to do the query like this:
SELECT  ProductionItem
FROM    FactoryProductions
WHERE   5 &gt;= FactoryID
AND 100 &gt;= ProductionAmount
which returns correct results but is slow.

I have an index on (FactoryID, ProductionAmount).

There are 13 Factories and 2,300,000 Items

This is in MySQL

This is a nice illustration of how index range evaluation works in MySQL and how to optimize it.

We will assume that the ProductionAmount is usually much higher than 100

Now, let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_factory (
factory INT NOT NULL,
item INT NOT NULL,
production FLOAT NOT NULL,
PRIMARY KEY (factory, item),
KEY (factory, production)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_factory (factory, item, production)
SELECT  (id - 1) div 100000 + 1, (id - 1) % 100000 + 1, RAND(20091007) * 10000000
FROM    filler;

This table contains 10 factories, 100,000 items and random production values from 0 to 10,000,000. This is less than your table but enough to show the principle.

Let's issue your query:

SELECT  *
FROM    t_factory
WHERE   factory &lt;= 5
        AND production &lt;= 100

View query details

factory	item	production
4	24768	25.5276
5	20126	8.67993
5	47547	11.3249
3 rows fetched in 0.0002s (0.4110s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_factory	range	PRIMARY,factory	PRIMARY	4		500326	100.00	Using where

select `20091007_in`.`t_factory`.`factory` AS `factory`,`20091007_in`.`t_factory`.`item` AS `item`,`20091007_in`.`t_factory`.`production` AS `production` from `20091007_in`.`t_factory` where ((`20091007_in`.`t_factory`.`factory` <= 5) and (`20091007_in`.`t_factory`.`production` <= 100))

We see that the query uses a range on the index over (factory, production) but it's still slow, 400 ms.

Why is it?

If we look into the execution plan, we will see that the key_len is 4. However, we have two columns in the index, so the total key length should be 8 (sizeof(int) + sizeof(int)).

This means that only a part of a key is used. MySQL uses the range to get all records with factory <= 5 then browses them and filters out those with production > 100.

This requires browsing lots of records.

MySQL does it because it cannot get a single, countiguous range from this condition. There may be potentially infinite number of records with factory <= 5, each on them having its own range of production <= 100, and these ranges are not consecutive.

Though MySQL cannot do a range access method on an infinite number of ranges, it can do it allright on a finite number of ranges.

That is, if we explicitly provide the id's of the factories, each id will make a single contiguous range: (id, -Inf) <= (factory, production) <= (id, 100)

And since we have a limited number of the id's, MySQL can combine all these ranges and get them in a single trip over the index.

Here's the query to do that:

SELECT  *
FROM    t_factory
WHERE   factory IN (1, 2, 3, 4, 5)
AND production &lt;= 100

View query details

factory	item	production
4	24768	25.5276
5	20126	8.67993
5	47547	11.3249
3 rows fetched in 0.0002s (0.0022s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_factory	range	PRIMARY,factory	factory	8		6	100.00	Using where; Using index

select `20091007_in`.`t_factory`.`factory` AS `factory`,`20091007_in`.`t_factory`.`item` AS `item`,`20091007_in`.`t_factory`.`production` AS `production` from `20091007_in`.`t_factory` where ((`20091007_in`.`t_factory`.`factory` in (1,2,3,4,5)) and (`20091007_in`.`t_factory`.`production` <= 100))

This query is instant, only 2 ms.

But what if this list is too long to generate?

We can build in runtime with the same condition. The only thing we need is to do it over the DISTINCT values factory. Since we have an index on (factory, production), DISTINCT is very fast because it will be optimized with an index for group by:

SELECT  fi.*
FROM    (
SELECT  DISTINCT factory
FROM    t_factory
WHERE   factory &lt;= 5
        ) fd
JOIN    t_factory fi
ON      fi.factory &lt;= fd.factory
        AND fi.factory &gt;= fd.factory
AND fi.production &lt;= 100

View query details

factory	item	production
4	24768	25.5276
5	20126	8.67993
5	47547	11.3249
3 rows fetched in 0.0002s (0.0026s)

id	select_type	table	type	possible_keys	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL				5	100.00
1	PRIMARY	fi	ALL	PRIMARY,factory			1000653	100.00	Range checked for each record (index map: 0x3)
2	DERIVED	t_factory	range	PRIMARY,factory	PRIMARY	4	9	100.00	Using where; Using index for group-by

select `20091007_in`.`fi`.`factory` AS `factory`,`20091007_in`.`fi`.`item` AS `item`,`20091007_in`.`fi`.`production` AS `production` from (select distinct `20091007_in`.`t_factory`.`factory` AS `factory` from `20091007_in`.`t_factory` where (`20091007_in`.`t_factory`.`factory` <= 5)) `fd` join `20091007_in`.`t_factory` `fi` where ((`20091007_in`.`fi`.`factory` <= `fd`.`factory`) and (`20091007_in`.`fi`.`factory` >= `fd`.`factory`) and (`20091007_in`.`fi`.`production` <= 100))

The query is split into three parts:

The range is applied to the table: only that part of the index is used which has factory <= 5.
DISTINCT clause is applied. Using index for group-by in the query plan means that the engine skips the duplicate values using the index: whenever it finds a key, it returns it and then searches for the next key. This is very fast.
Finally, t_factory is joined back. Note usage of the following condition in the WHERE clause: fi.factory <= fd.factory AND fi.factory >= fd.factory. Those who read MySQL articles in my blog are familiar with it: it forces the Range checked for each record. For each distinct factory returned from the query above, the engine evaluates the condition, finds it to be the equivalent to (fd.factory, -Inf) <= (fi.factory, fi.production) <= (fi.factory, +Inf) and uses the range access to find the values for this condition.

This query is only 2 ms, also instant.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

October 7th, 2009 at 11:00 pm

Posted in MySQL

2 Responses to 'IN list vs. range condition: MySQL'

Subscribe to comments with RSS

“Those who read MySQL articles in my blog are familiar with it: it forces the Range checked for each record. For each distinct(明显的) factory returned from the query above, the engine evaluates(评价) the condition, finds it to be the equivalent(等价物) to (fd.factory, -Inf) <= (fi.factory, fi.production) <= (fi.factory, +Inf) and uses the range access to find the values for this condition."

can u tell me "which article tell us that?"

phil

5 Jan 13 at 20:02
i find by google

phil

5 Jan 13 at 20:05

EXPLAIN EXTENDED