EXPLAIN EXTENDED

How to create fast database queries

MySQL: DISTINCT on DISTINCT

Comments enabled. I *really* need your comment

From Stack Overflow:

I currently have the following query:

SELECT  group_concat (DISTINCT usrFirst, usrLast) as receiver_name
FROM    …

It outputs my list of names as follows:

JohnDoe,BillSmith,DaveJones

, while desired ouput would be:

John Doe, Bill Smith, and Dave Jones

Basically, I need help figuring out three things:

  1. How can I put a space, between the first and last name?
  2. How can insert a space after each comma?
  3. How can I add an and right before the last name displayed?

This kind of output should be really processed on the client (since comma and the and word are very cultire specific).

From MySQL, we just need a recordset of distinct full names.

Let's create a sample table and see how do we do it:

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_person (
id INT NOT NULL PRIMARY KEY,
lastname VARCHAR(50) NOT NULL,
firstname VARCHAR(50) NOT NULL,
stuffing VARCHAR(100) NOT NULL,
KEY ix_person_lastname_firstname(lastname, firstname)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_person (id, lastname, firstname, stuffing)
SELECT  id,
        ELT(FLOOR(RAND(20090801) * 3) + 1, 'Adams', 'Smith', 'Davis Smith'),
        ELT(FLOOR(RAND(20090801 << 1) * 3) + 1, 'Alice', 'Jon', 'Jon Davis'),
        RPAD(CONCAT_WS(' ', 'User ', id), 100, ' ')
FROM    filler;

To get a list of distinct names, we could just do the following:

SELECT  DISTINCT CONCAT_WS(' ', firstname, lastname)
FROM    t_person

CONCAT_WS(' ', firstname, lastname)
Alice Adams
Jon Adams
Jon Davis Adams
Alice Davis Smith
Jon Davis Smith
Jon Davis Davis Smith
Alice Smith
Jon Smith
8 rows fetched in 0.0002s (0.4826s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_person index ix_person_lastname_firstname 304 100359 100.00 Using index; Using temporary

This works, but takes almost 500 ms, not that good.

From the query plan, we see that a temporary table is built to parse DISTINCT.

This is because there is no single index on MySQL that covers CONCAT_WS(' ', firstname, lastname).

However, we can select distinct values of the fields themselves and concatenate them after that:

SELECT  CONCAT_WS(' ', firstname, lastname)
FROM    (
SELECT  DISTINCT firstname, lastname
FROM    t_person
) q

CONCAT_WS(' ', firstname, lastname)
Alice Adams
Jon Adams
Jon Davis Adams
Alice Davis Smith
Jon Davis Smith
Jon Davis Davis Smith
Alice Smith
Jon Smith
Jon Davis Smith
9 rows fetched in 0.0002s (0.0019s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 9 100.00
2 DERIVED t_person range ix_person_lastname_firstname 304 87 100.00 Using index for group-by; Using temporary

This is much faster, almost instant.

However, we see that the latter query returns Jon Davis Smith twice.

This is because they are different people: one mr. Smith whose first name is John Davis, and one mr. Davis Smith whose first name is Jon.

Returning them twice may be in fact intentional (they are different people, after all).

However, if will just need to build a list of joined names that are distinct, we just can repeat the DISTINCT clause in the outer query:

SELECT  DISTINCT CONCAT_WS(' ', firstname, lastname)
FROM    (
SELECT  DISTINCT firstname, lastname
FROM    t_person
) q

CONCAT_WS(' ', firstname, lastname)
Alice Adams
Jon Adams
Jon Davis Adams
Alice Davis Smith
Jon Davis Smith
Jon Davis Davis Smith
Alice Smith
Jon Smith
8 rows fetched in 0.0002s (0.0016s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 9 100.00 Using temporary
2 DERIVED t_person range ix_person_lastname_firstname 304 87 100.00 Using index for group-by; Using temporary

This returns but 8 rows and Jon Davis Smith is returned only once. And despite using temporary in the plan, it's as fast as the first query, since temporary table is built to sieve 9 values, not 100,000.

Repeating DISTINCT clause may seem to be redundant.

However, MySQL optimizer cannot see the opportunity to use index for group-by if DISTINCT is applied to a complex expression.

By duplicating DISTINCT clause, we help MySQL to see this opportunity.

Written by Quassnoi

August 1st, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply