Greatest N per group: dealing with aggregates
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Vlad Enache asks:
In MySQL I have a table called
meanings
with three columns:
meanings person word meaning 1 1 4 1 2 19 1 2 7 1 3 5
word
has 16 possible values,meaning
has 26.A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.
There will be thousands of persons.
I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:
- word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
- word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)
etc.
Is it possible to solve this with a single MySQL query?
This task is a typical greatest-n-per-group problem.
Earlier I described some solutions to it, one using session variables, and another one using LIMIT
in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs
, and using the LIMIT
solution would be inefficient, since the aggregates can not be indexed.
Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).
Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT
.
Let's create a sample table:
Table creation details
The table contains 1,000,000 records for 32,500 people, each giving two random meanings to each of the 16 words.
All three fields comprise a PRIMARY KEY
, and there is an additional index on (word, meaning)
.
Here's what we need to do to build the query:
- Calculate the count of records with each meaning for each word, using
GROUP BY
.This will use the secondary index on
(word, meaning)
that we created above to avoid sorting (using filesort
) or materialization (using temporary
) GROUP_CONCAT
the concatenated meaning and count from the previous recordset, ordering the values by count in descending order.Note that theoretically this could result in very large strings which would not fit into
@@group_concat_max_len
and get truncated. But we can use it nevertheless: first, there are only 26 meanings per word possible, second, of these 26 meanings we only the first three.- Join the resultset with the dummy table containing 3 records. To do this, we will just concatenate 3 one-record
SELECT
statement usingUNION ALL
.This is not a very elegant solution, since it does not allow parameterizing the
n
for the query, but MySQL does not support dummy recordset generation functions likegenerate_series
.Alternatively, one could build a dummy table and use a
LIMIT
to select top values from it. But for the sake of simplicity, I will go with theUNION ALL
solution. - Extract the individual values from the concatenated string using
SUBSTRING_INDEX
- Calculate the ratios, using the number of distinct persons as a total.
This way, ratio of each word's meaning will be computed correctly, though the sum of ratios can exceed 100% (since one person can give more than one meaning to a word).
And here's the query:
SELECT word, 4 - lvl AS position, SUBSTRING_INDEX(rec, ':', 1) AS meaning, SUBSTRING_INDEX(rec, ':', -1) / total AS rate FROM ( SELECT word, total, SUBSTRING_INDEX(SUBSTRING_INDEX(tops, ',', -lvl), ',', 1) AS rec, lvl FROM ( SELECT word, total, SUBSTRING_INDEX(ms, ',', 3) AS tops, lvl FROM ( SELECT word, ( SELECT COUNT(*) FROM ( SELECT DISTINCT person FROM t_meaning ) dp ) AS total, CAST(GROUP_CONCAT(CONCAT_WS(':', meaning, cnt) ORDER BY cnt DESC, meaning DESC) AS CHAR) AS ms FROM ( SELECT word, meaning, COUNT(*) AS cnt FROM t_meaning GROUP BY word, meaning ) q GROUP BY word ) q2 CROSS JOIN ( SELECT 1 AS lvl UNION ALL SELECT 2 AS lvl UNION ALL SELECT 3 AS lvl ) fields ) q3 ) q4 ORDER BY word, position
word | position | meaning | rate | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 16 | 0.079456 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 2 | 17 | 0.079296 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 3 | 12 | 0.079264 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 1 | 9 | 0.079584 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 2 | 10 | 0.079264 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 3 | 20 | 0.078656 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 1 | 14 | 0.080064 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 2 | 18 | 0.079424 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | 3 | 4 | 0.078976 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 rows fetched in 0.0017s (1.1094s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 48 | 100.00 | Using filesort | ||||
2 | DERIVED | <derived3> | ALL | 48 | 100.00 | |||||
3 | DERIVED | <derived8> | ALL | 3 | 100.00 | |||||
3 | DERIVED | <derived4> | ALL | 16 | 100.00 | Using join buffer | ||||
8 | DERIVED | No tables used | ||||||||
9 | UNION | No tables used | ||||||||
10 | UNION | No tables used | ||||||||
UNION RESULT | <union8,9,10> | ALL | ||||||||
4 | DERIVED | <derived7> | ALL | 416 | 100.00 | Using filesort | ||||
7 | DERIVED | t_meaning | index | ix_meaning_word_meaning | 8 | 1000443 | 100.00 | Using index | ||
5 | SUBQUERY | Select tables optimized away | ||||||||
6 | DERIVED | t_meaning | range | PRIMARY | 4 | 58850 | 100.00 | Using index for group-by |
select `q4`.`word` AS `word`,(4 - `q4`.`lvl`) AS `position`,substring_index(`q4`.`rec`,':',1) AS `meaning`,(substring_index(`q4`.`rec`,':',-(1)) / `q4`.`total`) AS `rate` from (select `q3`.`word` AS `word`,`q3`.`total` AS `total`,substring_index(substring_index(`q3`.`tops`,',',-(`q3`.`lvl`)),',',1) AS `rec`,`q3`.`lvl` AS `lvl` from (select `q2`.`word` AS `word`,`q2`.`total` AS `total`,substring_index(`q2`.`ms`,',',3) AS `tops`,`fields`.`lvl` AS `lvl` from (select `q`.`word` AS `word`,(select count(0) AS `COUNT(*)` from (select distinct `20100318_word`.`t_meaning`.`person` AS `person` from `20100318_word`.`t_meaning`) `dp`) AS `total`,cast(group_concat(concat_ws(':',`q`.`meaning`,`q`.`cnt`) order by `q`.`cnt` DESC,`q`.`meaning` DESC separator ',') as char charset latin1) AS `ms` from (select `20100318_word`.`t_meaning`.`word` AS `word`,`20100318_word`.`t_meaning`.`meaning` AS `meaning`,count(0) AS `cnt` from `20100318_word`.`t_meaning` group by `20100318_word`.`t_meaning`.`word`,`20100318_word`.`t_meaning`.`meaning`) `q` group by `q`.`word`) `q2` join (select 1 AS `lvl` union all select 2 AS `lvl` union all select 3 AS `lvl`) `fields`) `q3`) `q4` order by `q4`.`word`,(4 - `q4`.`lvl`)
Note that we used
SELECT COUNT(*) FROM ( SELECT DISTINCT person FROM t_meaning ) q
instead of more simple
SELECT COUNT(DISTINCT person) FROM t_meaning
The latter looks more elegant, but MySQL is not able to use INDEX FOR GROUP BY
for it.
That's why we used the former subquery to calculate the totals, and we can see in the query plan that the INDEX FOR GROUP BY
is being used.
The overall query time is 1.10 sec which is just a little more than the time required to iterate all records in the table.
Hope that helps.
I'm always glad to answer the questions regarding database queries.