IN list vs. range condition: MySQL
Answering questions asked on the site.
Princess asks:
Hello, I've got a problem with SQL.
I have a table which contains the production details for the factories. They are not factories and items of course but I cannot disclose the project and need to obfuscate so let's pretend they are :)
I need to select the items for which the first 5 factories have low production rate.
I tried to do the query like this:
SELECT ProductionItem FROM FactoryProductions WHERE 5 >= FactoryID AND 100 >= ProductionAmountwhich returns correct results but is slow.
I have an index on
(FactoryID, ProductionAmount)
.There are 13 Factories and 2,300,000 Items
This is in MySQL
This is a nice illustration of how index range evaluation works in MySQL and how to optimize it.
We will assume that the ProductionAmount
is usually much higher than 100
Now, let's create a sample table:
Table creation details
This table contains 10 factories, 100,000 items and random production values from 0 to 10,000,000. This is less than your table but enough to show the principle.
Let's issue your query:
SELECT * FROM t_factory WHERE factory <= 5 AND production <= 100
We see that the query uses a range
on the index over (factory, production)
but it's still slow, 400 ms.
Why is it?
If we look into the execution plan, we will see that the key_len
is 4
. However, we have two columns in the index, so the total key length should be 8 (sizeof(int) + sizeof(int)
).
This means that only a part of a key is used. MySQL uses the range to get all records with factory <= 5
then browses them and filters out those with production > 100
.
This requires browsing lots of records.
MySQL does it because it cannot get a single, countiguous range from this condition. There may be potentially infinite number of records with factory <= 5
, each on them having its own range of production <= 100
, and these ranges are not consecutive.
Though MySQL cannot do a range
access method on an infinite number of ranges, it can do it allright on a finite number of ranges.
That is, if we explicitly provide the id's of the factories, each id will make a single contiguous range: (id, -Inf) <= (factory, production) <= (id, 100)
And since we have a limited number of the id
's, MySQL can combine all these ranges and get them in a single trip over the index.
Here's the query to do that:
SELECT * FROM t_factory WHERE factory IN (1, 2, 3, 4, 5) AND production <= 100
This query is instant, only 2 ms.
But what if this list is too long to generate?
We can build in runtime with the same condition. The only thing we need is to do it over the DISTINCT
values factory
. Since we have an index on (factory, production)
, DISTINCT
is very fast because it will be optimized with an index for group by
:
SELECT fi.* FROM ( SELECT DISTINCT factory FROM t_factory WHERE factory <= 5 ) fd JOIN t_factory fi ON fi.factory <= fd.factory AND fi.factory >= fd.factory AND fi.production <= 100
The query is split into three parts:
- The
range
is applied to the table: only that part of the index is used which hasfactory <= 5
. DISTINCT
clause is applied.Using index for group-by
in the query plan means that the engine skips the duplicate values using the index: whenever it finds a key, it returns it and then searches for the next key. This is very fast.- Finally,
t_factory
is joined back. Note usage of the following condition in theWHERE
clause:fi.factory <= fd.factory AND fi.factory >= fd.factory
. Those who read MySQL articles in my blog are familiar with it: it forces theRange checked for each record
. For each distinctfactory
returned from the query above, the engine evaluates the condition, finds it to be the equivalent to(fd.factory, -Inf) <= (fi.factory, fi.production) <= (fi.factory, +Inf)
and uses the range access to find the values for this condition.
This query is only 2 ms, also instant.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
“Those who read MySQL articles in my blog are familiar with it: it forces the Range checked for each record. For each distinct(明显的) factory returned from the query above, the engine evaluates(评价) the condition, finds it to be the equivalent(等价物) to (fd.factory, -Inf) <= (fi.factory, fi.production) <= (fi.factory, +Inf) and uses the range access to find the values for this condition."
can u tell me "which article tell us that?"
phil
5 Jan 13 at 20:02
i find by google
phil
5 Jan 13 at 20:05