Index search time depends on the value being searched
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Daniel asks:
I have a table which stores track titles in a
VARCHAR(200)
field. The field is indexed, but searching for titles beginning with a letter Z is noticeably slower than for those beginning with A, and the closer the initial letter is to Z, the slower is the query.My understanding is that a full table scan occurs, but
EXPLAIN
shows that the index is used. Besides, the table is quite large but the query is still reasonably fast.Could you please explain this behavior?
MySQL stores its indexes in B-Tree data structures.
The Wikipedia link above explains the structure quite well so I won't repeat it. I'll rather draw a picture similar to the one in the article:
This picture is quite self-explanatory. The records are sorted in a tree order, so if you are searching for a certain value, say, 11, you, starting from the first page, should find the link to follow. To do this, you need to find the pair of values less than and greater than 11. In this case, you should follow the link which is between 8 and 12. Then you search for the next pair, etc, until you find your value or reach the end and make sure that your value is not there.
Following the links is quite simple, but how does the engine search for the values within one page?
This depends on how you declared the table.
MyISAM supports two algorithms for storing the index keys in a page: packed keys and unpacked keys.
Unpacked keys are what you are seeing on the picture above: each page just stores the key values and the links to the pages down the tree. This is very simple.
Packed keys are designed to improve performance on character data. Many words and phrases, especially those that are close to each other, start with the same sequence of characters.
If you are going to store track names like:
The Man Who Sold The World |
The Man Who Invented Himself |
The Man Who Has Everything |
, MyISAM can optimize it in terms of storage space and store them like this
(The Man Who) Sold The World |
(×11) Invented Himself |
(×11) Has Everything |
This is called key compression: instead of repeating the key value for each record, MyISAM just stores the longest common prefix once and prepends the subsequent records that share it with its lengths. This makes the keys shorter and the index more compact.
However, this affects the index search time.
With an unpacked index, a binary search is applied to find the keys within each level page.
With a packed index, this won't work: you need to know the value of two keys to compare them and not every record contains full information about the key.
So in case of a packed index, MySQL remembers the value of the prefix and iterates the records one by one. This is less efficient than a binary search, but due to the fact that much more records can fit on on page, this keeps the amount of page traversals to a minimum and overall efficiency increases.
But the records on one page still need to be compared and searched for. And with a linear search, the keys with less values tend to require less iterations than those with greater values.
Let's look on the picture above again. The keys are searched left to right.
To search for key 1, we only need two comparisons: compare to 4, get the next page, compare to 1.
But to search for key 15, we need to compare to 4, 8, 12 then get to the next page and compare to 13, 14 and finally to 15.
This is 6 operations compared to 2 required to fetch the first key.
Now, let's create the sample tables and see some figures:
There are two MyISAM tables with randomly generated character sequences like this:
id | name | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | RTUPPH | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | RKZQJW | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | FKMEKL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | BYOZFE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | GSTRAF | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | YBNMSG | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | ZEZKCE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | PMPNUJ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | OQMMYH | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | OYAFDZ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49999 | DZMKRC | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50000 | NHYWLR |
The structure of the tables is identical, except that t_packed
packs keys and t_unpacked
does not.
Unpacked keys
SELECT COUNT(*) FROM t_source s LEFT JOIN t_unpacked p ON p.name = IF(s.id > -1, _UTF8'ABCDEF', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7656s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_unpacked p ON p.name = IF(s.id > -1, _UTF8'GHIJKL', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7331s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_unpacked p ON p.name = IF(s.id > -1, _UTF8'NOPQRS', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7656s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_unpacked p ON p.name = IF(s.id > -1, _UTF8'ZYXWVU', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.7712s) |
The queries against the strings beginning with A, G, N and Z take the same time. All queries have been run several times to populate the cache and the execution times are consistent.
The LEFT JOIN
against a non-existent value was used in the query to avoid stopping on a found key and make the query traverse the index as much as possible. We also put a formal dependency on s.id
here so that t_source
is always leading in the join and no const
optimizations are performed.
Packed keys
Let's try the same queries on packed keys:
SELECT COUNT(*) FROM t_source s LEFT JOIN t_packed p ON p.name = IF(s.id > -1, _UTF8'ABCDEF', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (1.9531s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_packed p ON p.name = IF(s.id > -1, _UTF8'GHIJKL', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (2.3593s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_packed p ON p.name = IF(s.id > -1, _UTF8'NOPQRS', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (2.7812s) |
SELECT COUNT(*) FROM t_source s LEFT JOIN t_packed p ON p.name = IF(s.id > -1, _UTF8'ZYXWVU', NULL)
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (2.9375s) |
The query against the string beginning with the letter Z takes more than 50% more time than the query against a string beginning with A.
Summary
Here's a little summary table:
Search string | Unpacked key | Time, % | Packed key | Time, % |
---|---|---|---|---|
ABCDEF | 0.7656 | 100.00 | 1.9531 | 100.00 |
GHIJKL | 0.7331 | 95.75 | 2.3593 | 120.79 |
NOPQRS | 0.7656 | 100.00 | 2.7812 | 142.39 |
ZYXWVU | 0.7712 | 100.73 | 2.9375 | 150.40 |
We see that the value being searched for does not affect time to search the index with unpacked keys but seriously affects performance of the indexes with packed keys.
This increase is due to linear search used to locate the records within a single page.
Note that not any pair of records in a page (both with packed and unpacked keys) have a corresponding lower-level page containing intermediate values. There can be leaves and branches on the same depth in the tree.
This can lead to some artifacts: certain strings can be found (or proved absent) faster than the others. However, in average, all records have same depth. And in average, with packed indexes, the need for linear search increases the time required to find the keys with the greater values.
Hope that helps.
I'm always glad to answer the questions regarding database queries.