Index search time depends on the value being searched at EXPLAIN EXTENDED

Index search time depends on the value being searched

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Daniel asks:

I have a table which stores track titles in a VARCHAR(200) field. The field is indexed, but searching for titles beginning with a letter Z is noticeably slower than for those beginning with A, and the closer the initial letter is to Z, the slower is the query.

My understanding is that a full table scan occurs, but EXPLAIN shows that the index is used. Besides, the table is quite large but the query is still reasonably fast.

Could you please explain this behavior?

MySQL stores its indexes in B-Tree data structures.

The Wikipedia link above explains the structure quite well so I won't repeat it. I'll rather draw a picture similar to the one in the article:

This picture is quite self-explanatory. The records are sorted in a tree order, so if you are searching for a certain value, say, 11, you, starting from the first page, should find the link to follow. To do this, you need to find the pair of values less than and greater than 11. In this case, you should follow the link which is between 8 and 12. Then you search for the next pair, etc, until you find your value or reach the end and make sure that your value is not there.

Following the links is quite simple, but how does the engine search for the values within one page?

This depends on how you declared the table.

MyISAM supports two algorithms for storing the index keys in a page: packed keys and unpacked keys.

Unpacked keys are what you are seeing on the picture above: each page just stores the key values and the links to the pages down the tree. This is very simple.

Packed keys are designed to improve performance on character data. Many words and phrases, especially those that are close to each other, start with the same sequence of characters.

If you are going to store track names like:

The Man Who Sold The World

The Man Who Invented Himself

The Man Who Has Everything

, MyISAM can optimize it in terms of storage space and store them like this

(The Man Who) Sold The World

(×11) Invented Himself

(×11) Has Everything

This is called key compression: instead of repeating the key value for each record, MyISAM just stores the longest common prefix once and prepends the subsequent records that share it with its lengths. This makes the keys shorter and the index more compact.

However, this affects the index search time.

With an unpacked index, a binary search is applied to find the keys within each level page.

With a packed index, this won't work: you need to know the value of two keys to compare them and not every record contains full information about the key.

So in case of a packed index, MySQL remembers the value of the prefix and iterates the records one by one. This is less efficient than a binary search, but due to the fact that much more records can fit on on page, this keeps the amount of page traversals to a minimum and overall efficiency increases.

But the records on one page still need to be compared and searched for. And with a linear search, the keys with less values tend to require less iterations than those with greater values.

Let's look on the picture above again. The keys are searched left to right.

To search for key 1, we only need two comparisons: compare to 4, get the next page, compare to 1.

But to search for key 15, we need to compare to 4, 8, 12 then get to the next page and compare to 13, 14 and finally to 15.

This is 6 operations compared to 2 required to fetch the first key.

Now, let's create the sample tables and see some figures:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_source (
id INT NOT NULL PRIMARY KEY
) ENGINE=MyISAM;

CREATE TABLE t_packed (
id INT NOT NULL PRIMARY KEY,
name CHAR(6) NOT NULL,
KEY ix_packed_name (name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;

CREATE TABLE t_unpacked (
id INT NOT NULL PRIMARY KEY,
name CHAR(6) NOT NULL,
KEY ix_unpacked_name (name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=0;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(50000);
COMMIT;

INSERT
INTO    t_source
SELECT  id
FROM    filler;

INSERT
INTO    t_packed
SELECT  id,
        (
        SELECT  GROUP_CONCAT(CHAR(65 + FLOOR(RAND(20100204) * 26)) SEPARATOR '')
        FROM    (
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                ) q
        )
FROM    filler;

INSERT
INTO    t_unpacked
SELECT  *
FROM    t_packed;

There are two MyISAM tables with randomly generated character sequences like this:

id	name
1	RTUPPH
2	RKZQJW
3	FKMEKL
4	BYOZFE
5	GSTRAF
6	YBNMSG
7	ZEZKCE
8	PMPNUJ
9	OQMMYH
10	OYAFDZ

49999	DZMKRC
50000	NHYWLR

The structure of the tables is identical, except that t_packed packs keys and t_unpacked does not.

Unpacked keys

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8'ABCDEF', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (0.7656s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8'GHIJKL', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (0.7331s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8'NOPQRS', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (0.7656s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8'ZYXWVU', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (0.7712s)

The queries against the strings beginning with A, G, N and Z take the same time. All queries have been run several times to populate the cache and the execution times are consistent.

The LEFT JOIN against a non-existent value was used in the query to avoid stopping on a found key and make the query traverse the index as much as possible. We also put a formal dependency on s.id here so that t_source is always leading in the join and no const optimizations are performed.

Packed keys

Let's try the same queries on packed keys:

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8'ABCDEF', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (1.9531s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8'GHIJKL', NULL)

COUNT(*)
50000
1 row fetched in 0.0001s (2.3593s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8'NOPQRS', NULL)

COUNT(*)
50000
1 row fetched in 0.0002s (2.7812s)

SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8'ZYXWVU', NULL)

COUNT(*)
50000
1 row fetched in 0.0002s (2.9375s)

The query against the string beginning with the letter Z takes more than 50% more time than the query against a string beginning with A.

Summary

Here's a little summary table:

Search string	Unpacked key	Time, %	Packed key	Time, %
ABCDEF	0.7656	100.00	1.9531	100.00
GHIJKL	0.7331	95.75	2.3593	120.79
NOPQRS	0.7656	100.00	2.7812	142.39
ZYXWVU	0.7712	100.73	2.9375	150.40

We see that the value being searched for does not affect time to search the index with unpacked keys but seriously affects performance of the indexes with packed keys.

This increase is due to linear search used to locate the records within a single page.

Note that not any pair of records in a page (both with packed and unpacked keys) have a corresponding lower-level page containing intermediate values. There can be leaves and branches on the same depth in the tree.

This can lead to some artifacts: certain strings can be found (or proved absent) faster than the others. However, in average, all records have same depth. And in average, with packed indexes, the need for linear search increases the time required to find the keys with the greater values.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 4th, 2010 at 11:00 pm

Posted in MySQL

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

EXPLAIN EXTENDED