Greatest N per group: dealing with aggregates at EXPLAIN EXTENDED

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings

person word meaning

1 1 4

1 2 19

1 2 7

1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)

word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

meanings
person	word	meaning
1	1	4
1	2	19
1	2	7
1	3	5

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_meaning (
person INT NOT NULL,
word INT NOT NULL,
meaning INT NOT NULL,
PRIMARY KEY (person, word, meaning),
KEY ix_meaning_word_meaning (word, meaning)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_meaning (person, word, meaning)
SELECT  (id - 1) div 32 + 1,
        (id - 1) % 16 + 1,
        ((((id - 1) div 16) % 2) * 13) + CEILING(RAND(20100318 &lt;&lt; 1) * 13)
FROM    filler;

The table contains 1,000,000 records for 32,500 people, each giving two random meanings to each of the 16 words.

All three fields comprise a PRIMARY KEY, and there is an additional index on (word, meaning).

Here's what we need to do to build the query:

Calculate the count of records with each meaning for each word, using GROUP BY.
This will use the secondary index on (word, meaning) that we created above to avoid sorting (using filesort) or materialization (using temporary)
GROUP_CONCAT the concatenated meaning and count from the previous recordset, ordering the values by count in descending order.
Note that theoretically this could result in very large strings which would not fit into @@group_concat_max_len and get truncated. But we can use it nevertheless: first, there are only 26 meanings per word possible, second, of these 26 meanings we only the first three.
Join the resultset with the dummy table containing 3 records. To do this, we will just concatenate 3 one-record SELECT statement using UNION ALL.
This is not a very elegant solution, since it does not allow parameterizing the n for the query, but MySQL does not support dummy recordset generation functions like generate_series.

Alternatively, one could build a dummy table and use a LIMIT to select top values from it. But for the sake of simplicity, I will go with the UNION ALL solution.
Extract the individual values from the concatenated string using SUBSTRING_INDEX
Calculate the ratios, using the number of distinct persons as a total.
This way, ratio of each word's meaning will be computed correctly, though the sum of ratios can exceed 100% (since one person can give more than one meaning to a word).

And here's the query:

SELECT  word, 4 - lvl AS position, SUBSTRING_INDEX(rec, ':', 1) AS meaning, SUBSTRING_INDEX(rec, ':', -1) / total AS rate
FROM    (
SELECT  word, total, SUBSTRING_INDEX(SUBSTRING_INDEX(tops, ',', -lvl), ',', 1) AS rec, lvl
FROM    (
SELECT  word, total, SUBSTRING_INDEX(ms, ',', 3) AS tops, lvl
FROM    (
SELECT  word,
(
SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT person
FROM    t_meaning
) dp
) AS total,
CAST(GROUP_CONCAT(CONCAT_WS(':', meaning, cnt) ORDER BY cnt DESC, meaning DESC) AS CHAR) AS ms
FROM    (
SELECT  word, meaning, COUNT(*) AS cnt
FROM    t_meaning
GROUP BY
word, meaning
) q
GROUP BY
word
) q2
CROSS JOIN
(
SELECT  1 AS lvl
UNION ALL
SELECT  2 AS lvl
UNION ALL
SELECT  3 AS lvl
) fields
) q3
) q4
ORDER BY
word, position

word	position	meaning	rate
1	1	16	0.079456
1	2	17	0.079296
1	3	12	0.079264
2	1	9	0.079584
2	2	10	0.079264
2	3	20	0.078656

16	1	14	0.080064
16	2	18	0.079424
16	3	4	0.078976
48 rows fetched in 0.0017s (1.1094s)

id	select_type	table	type	key	key_len	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL			48	100.00	Using filesort
2	DERIVED	<derived3>	ALL			48	100.00
3	DERIVED	<derived8>	ALL			3	100.00
3	DERIVED	<derived4>	ALL			16	100.00	Using join buffer
8	DERIVED							No tables used
9	UNION							No tables used
10	UNION							No tables used
	UNION RESULT	<union8,9,10>	ALL
4	DERIVED	<derived7>	ALL			416	100.00	Using filesort
7	DERIVED	t_meaning	index	ix_meaning_word_meaning	8	1000443	100.00	Using index
5	SUBQUERY							Select tables optimized away
6	DERIVED	t_meaning	range	PRIMARY	4	58850	100.00	Using index for group-by

select `q4`.`word` AS `word`,(4 - `q4`.`lvl`) AS `position`,substring_index(`q4`.`rec`,':',1) AS `meaning`,(substring_index(`q4`.`rec`,':',-(1)) / `q4`.`total`) AS `rate` from (select `q3`.`word` AS `word`,`q3`.`total` AS `total`,substring_index(substring_index(`q3`.`tops`,',',-(`q3`.`lvl`)),',',1) AS `rec`,`q3`.`lvl` AS `lvl` from (select `q2`.`word` AS `word`,`q2`.`total` AS `total`,substring_index(`q2`.`ms`,',',3) AS `tops`,`fields`.`lvl` AS `lvl` from (select `q`.`word` AS `word`,(select count(0) AS `COUNT(*)` from (select distinct `20100318_word`.`t_meaning`.`person` AS `person` from `20100318_word`.`t_meaning`) `dp`) AS `total`,cast(group_concat(concat_ws(':',`q`.`meaning`,`q`.`cnt`) order by `q`.`cnt` DESC,`q`.`meaning` DESC separator ',') as char charset latin1) AS `ms` from (select `20100318_word`.`t_meaning`.`word` AS `word`,`20100318_word`.`t_meaning`.`meaning` AS `meaning`,count(0) AS `cnt` from `20100318_word`.`t_meaning` group by `20100318_word`.`t_meaning`.`word`,`20100318_word`.`t_meaning`.`meaning`) `q` group by `q`.`word`) `q2` join (select 1 AS `lvl` union all select 2 AS `lvl` union all select 3 AS `lvl`) `fields`) `q3`) `q4` order by `q4`.`word`,(4 - `q4`.`lvl`)

Note that we used

SELECT  COUNT(*)
FROM    (
SELECT  DISTINCT person
FROM    t_meaning
) q

instead of more simple

SELECT  COUNT(DISTINCT person)
FROM    t_meaning

The latter looks more elegant, but MySQL is not able to use INDEX FOR GROUP BY for it.

That's why we used the former subquery to calculate the totals, and we can see in the query plan that the INDEX FOR GROUP BY is being used.

The overall query time is 1.10 sec which is just a little more than the time required to iterate all records in the table.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED