EXPLAIN EXTENDED

How to create fast database queries

Index search time depends on the value being searched

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Daniel asks:

I have a table which stores track titles in a VARCHAR(200) field. The field is indexed, but searching for titles beginning with a letter Z is noticeably slower than for those beginning with A, and the closer the initial letter is to Z, the slower is the query.

My understanding is that a full table scan occurs, but EXPLAIN shows that the index is used. Besides, the table is quite large but the query is still reasonably fast.

Could you please explain this behavior?

MySQL stores its indexes in B-Tree data structures.

The Wikipedia link above explains the structure quite well so I won't repeat it. I'll rather draw a picture similar to the one in the article:

This picture is quite self-explanatory. The records are sorted in a tree order, so if you are searching for a certain value, say, 11, you, starting from the first page, should find the link to follow. To do this, you need to find the pair of values less than and greater than 11. In this case, you should follow the link which is between 8 and 12. Then you search for the next pair, etc, until you find your value or reach the end and make sure that your value is not there.

Following the links is quite simple, but how does the engine search for the values within one page?

This depends on how you declared the table.

MyISAM supports two algorithms for storing the index keys in a page: packed keys and unpacked keys.

Unpacked keys are what you are seeing on the picture above: each page just stores the key values and the links to the pages down the tree. This is very simple.

Packed keys are designed to improve performance on character data. Many words and phrases, especially those that are close to each other, start with the same sequence of characters.

If you are going to store track names like:

The Man Who Sold The World
The Man Who Invented Himself
The Man Who Has Everything

, MyISAM can optimize it in terms of storage space and store them like this

(The Man Who) Sold The World
(×11) Invented Himself
(×11) Has Everything

This is called key compression: instead of repeating the key value for each record, MyISAM just stores the longest common prefix once and prepends the subsequent records that share it with its lengths. This makes the keys shorter and the index more compact.

However, this affects the index search time.

With an unpacked index, a binary search is applied to find the keys within each level page.

With a packed index, this won't work: you need to know the value of two keys to compare them and not every record contains full information about the key.

So in case of a packed index, MySQL remembers the value of the prefix and iterates the records one by one. This is less efficient than a binary search, but due to the fact that much more records can fit on on page, this keeps the amount of page traversals to a minimum and overall efficiency increases.

But the records on one page still need to be compared and searched for. And with a linear search, the keys with less values tend to require less iterations than those with greater values.

Let's look on the picture above again. The keys are searched left to right.

To search for key 1, we only need two comparisons: compare to 4, get the next page, compare to 1.

But to search for key 15, we need to compare to 4, 8, 12 then get to the next page and compare to 13, 14 and finally to 15.

This is 6 operations compared to 2 required to fetch the first key.

Now, let's create the sample tables and see some figures:

Table creation details

There are two MyISAM tables with randomly generated character sequences like this:

id name
1 RTUPPH
2 RKZQJW
3 FKMEKL
4 BYOZFE
5 GSTRAF
6 YBNMSG
7 ZEZKCE
8 PMPNUJ
9 OQMMYH
10 OYAFDZ
49999 DZMKRC
50000 NHYWLR

The structure of the tables is identical, except that t_packed packs keys and t_unpacked does not.

Unpacked keys


SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id > -1, _UTF8'ABCDEF', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (0.7656s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id > -1, _UTF8'GHIJKL', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (0.7331s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id > -1, _UTF8'NOPQRS', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (0.7656s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id > -1, _UTF8'ZYXWVU', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (0.7712s)

The queries against the strings beginning with A, G, N and Z take the same time. All queries have been run several times to populate the cache and the execution times are consistent.

The LEFT JOIN against a non-existent value was used in the query to avoid stopping on a found key and make the query traverse the index as much as possible. We also put a formal dependency on s.id here so that t_source is always leading in the join and no const optimizations are performed.

Packed keys


Let's try the same queries on packed keys:

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id > -1, _UTF8'ABCDEF', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (1.9531s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id > -1, _UTF8'GHIJKL', NULL)
COUNT(*)
50000
1 row fetched in 0.0001s (2.3593s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id > -1, _UTF8'NOPQRS', NULL)
COUNT(*)
50000
1 row fetched in 0.0002s (2.7812s)
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id > -1, _UTF8'ZYXWVU', NULL)
COUNT(*)
50000
1 row fetched in 0.0002s (2.9375s)

The query against the string beginning with the letter Z takes more than 50% more time than the query against a string beginning with A.

Summary


Here's a little summary table:

Search string Unpacked key Time, % Packed key Time, %
ABCDEF 0.7656 100.00 1.9531 100.00
GHIJKL 0.7331 95.75 2.3593 120.79
NOPQRS 0.7656 100.00 2.7812 142.39
ZYXWVU 0.7712 100.73 2.9375 150.40

We see that the value being searched for does not affect time to search the index with unpacked keys but seriously affects performance of the indexes with packed keys.

This increase is due to linear search used to locate the records within a single page.

Note that not any pair of records in a page (both with packed and unpacked keys) have a corresponding lower-level page containing intermediate values. There can be leaves and branches on the same depth in the tree.

This can lead to some artifacts: certain strings can be found (or proved absent) faster than the others. However, in average, all records have same depth. And in average, with packed indexes, the need for linear search increases the time required to find the keys with the greater values.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 4th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply