MySQL: selecting a list of the first letters at EXPLAIN EXTENDED

I would like to produce a character list of all of the first letters of column in my database.

Is there a way to do this in MySQL?

Let's create a sample table of 1,000,000 records and fill it with random data:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_names (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(50) NOT NULL,
KEY ix_names_name (name)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

SET @l_vowels = 'aeiouy';
SET @u_vowels = UPPER(@l_vowels);
SET @l_consonants = 'bcdfghjklmnpqrstvwxz';
SET @u_consonants = UPPER(@l_consonants);

INSERT
INTO    t_names(id, name)
SELECT  id,
        (
        SELECT  GROUP_CONCAT(
                IF(
                (slen - id) % 2 = 0,
                SUBSTR(IF(l.id = 1, @u_vowels, @l_vowels), FLOOR(RAND(20090508 &lt;&lt; 2) * LENGTH(@l_vowels) + 1), 1), 
                SUBSTR(IF(l.id = 1, @u_consonants, @l_consonants), FLOOR(RAND(20090508 &lt;&lt; 3) * LENGTH(@l_consonants) + 1), 1)
                )
                ORDER BY l.id
                SEPARATOR '')
        FROM    (
                SELECT  *
                FROM    filler
                LIMIT 20
                ) l
        WHERE   l.id &lt;= slen
        )
FROM    (
        SELECT  id,
                RAND(20090508) AS rnd,
                FLOOR(RAND(20090508 &lt;&lt; 1) * 10) + 4 AS slen
        FROM    filler
        ) q;

Now, there is a nice simple query to do this:

SELECT  DISTINCT(SUBSTRING(name, 1, 1))
FROM    t_names

(SUBSTRING(name, 1, 1))
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
26 rows fetched in 0.0004s (1.6406s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_names	index		ix_names_name	152		997252	100.00	Using index; Using temporary

This query, however, cannot use the indexes and, thus, is quite iinefficient: it runs for 1600 ms.

To make it more efficient, we need to uses indexes. Actually, we just need to jump from the first index key to key starting with a letter next to the given, and so on. Thus, will will visit only one index leaf per letter.

Unfortunately, MySQL does not allow using indexes on condtions like SUBSTRING(name, 1, 1), despite the fact that only first letters matter in this case.

That's why we'll need to emulate this behavior.

The query we are going to use will perform the following steps:

The query will use the table itself (t_names) as a dummy rowsource (i. e. only for generating a set of rows, ignoring the values)
A session variable will be assigned with a code of the first letter of current MIN(name)
On each row in the dummy rowset, the variable will be assigned with a new value: the code of the letter next to the current one in the table.
As soon nothing more is returned, the query should be stopped

Due to the bug in MySQL, range access method is never being chosen for a correlated subquery if the range condition comes from the outside.

To work around this, we will need to create a function, which, given a code of a letter, returns the code of the letter next to the given one.

Here's the function:

CREATE FUNCTION fn_get_next_code(initial INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _next VARCHAR(200);
DECLARE EXIT HANDLER FOR NOT FOUND RETURN NULL;
SELECT  ORD(SUBSTRING(name, 1, 1))
INTO    _next
FROM    t_names
WHERE   name &gt;= CHAR(initial + 1)
ORDER BY
name
LIMIT 1;
RETURN _next;
END

, and here's the query:

SELECT  current
FROM    (
SELECT  CHAR(@r) AS current,
@r := fn_get_next_code(@r)
FROM    (
SELECT  @r := ORD(SUBSTRING(MIN(name), 1, 1))
FROM    t_names
) vars,
(
SELECT  1
FROM  t_names
LIMIT 100
) q
WHERE   @r IS NOT NULL
) qo

current
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
26 rows fetched in 0.0004s (0.0049s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL			26	100.00
2	DERIVED	<derived3>	system			1	100.00
2	DERIVED	<derived4>	ALL			100	100.00	Using where
4	DERIVED	t_names	index	PRIMARY	4	997252	100.00	Using index
3	DERIVED							Select tables optimized away

The query is now instant (4 ms).

This, of course, pays for itself only when choosing first value a very large dataset (similar to one described above). On smaller dataset, DISTINCT is almost instant too.

But since DISTINCT uses filesort which is quite memory- and CPU-intensive, this approach may be useful even for smaller datasets in a highly-concurrent environment.

Written by Quassnoi

May 8th, 2009 at 11:30 pm

Posted in MySQL

One Response to 'Selecting first letters'

Subscribe to comments with RSS

Very nice indeed! Helped me alot! :)

DAVEologist

1 Feb 13 at 16:36

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

EXPLAIN EXTENDED

Selecting first letters

One Response to 'Selecting first letters'

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Selecting first letters

Share this:

One Response to 'Selecting first letters'

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow