EXPLAIN EXTENDED

How to create fast database queries

Selecting first letters

with one comment

From Stack Overflow:

I would like to produce a character list of all of the first letters of column in my database.

Is there a way to do this in MySQL?

Let's create a sample table of 1,000,000 records and fill it with random data:

Table creation details

Now, there is a nice simple query to do this:

SELECT  DISTINCT(SUBSTRING(name, 1, 1))
FROM    t_names

(SUBSTRING(name, 1, 1))
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
26 rows fetched in 0.0004s (1.6406s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_names index ix_names_name 152 997252 100.00 Using index; Using temporary

This query, however, cannot use the indexes and, thus, is quite iinefficient: it runs for 1600 ms.

To make it more efficient, we need to uses indexes. Actually, we just need to jump from the first index key to key starting with a letter next to the given, and so on. Thus, will will visit only one index leaf per letter.

Unfortunately, MySQL does not allow using indexes on condtions like SUBSTRING(name, 1, 1), despite the fact that only first letters matter in this case.

That's why we'll need to emulate this behavior.

The query we are going to use will perform the following steps:

  1. The query will use the table itself (t_names) as a dummy rowsource (i. e. only for generating a set of rows, ignoring the values)
  2. A session variable will be assigned with a code of the first letter of current MIN(name)
  3. On each row in the dummy rowset, the variable will be assigned with a new value: the code of the letter next to the current one in the table.
  4. As soon nothing more is returned, the query should be stopped

Due to the bug in MySQL, range access method is never being chosen for a correlated subquery if the range condition comes from the outside.

To work around this, we will need to create a function, which, given a code of a letter, returns the code of the letter next to the given one.

Here's the function:

CREATE FUNCTION fn_get_next_code(initial INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _next VARCHAR(200);
DECLARE EXIT HANDLER FOR NOT FOUND RETURN NULL;
SELECT  ORD(SUBSTRING(name, 1, 1))
INTO    _next
FROM    t_names
WHERE   name >= CHAR(initial + 1)
ORDER BY
name
LIMIT 1;
RETURN _next;
END

, and here's the query:

SELECT  current
FROM    (
SELECT  CHAR(@r) AS current,
@r := fn_get_next_code(@r)
FROM    (
SELECT  @r := ORD(SUBSTRING(MIN(name), 1, 1))
FROM    t_names
) vars,
(
SELECT  1
FROM  t_names
LIMIT 100
) q
WHERE   @r IS NOT NULL
) qo
current
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
26 rows fetched in 0.0004s (0.0049s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 26 100.00
2 DERIVED <derived3> system 1 100.00
2 DERIVED <derived4> ALL 100 100.00 Using where
4 DERIVED t_names index PRIMARY 4 997252 100.00 Using index
3 DERIVED Select tables optimized away

The query is now instant (4 ms).

This, of course, pays for itself only when choosing first value a very large dataset (similar to one described above). On smaller dataset, DISTINCT is almost instant too.

But since DISTINCT uses filesort which is quite memory- and CPU-intensive, this approach may be useful even for smaller datasets in a highly-concurrent environment.

Written by Quassnoi

May 8th, 2009 at 11:30 pm

Posted in MySQL

One Response to 'Selecting first letters'

Subscribe to comments with RSS

  1. Very nice indeed! Helped me alot! :)

    DAVEologist

    1 Feb 13 at 16:36

Leave a Reply