How to create fast database queries

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)


Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:

Table creation details

The table contains 1,000,000 records for 32,500 people, each giving two random meanings to each of the 16 words.

All three fields comprise a PRIMARY KEY, and there is an additional index on (word, meaning).

Here's what we need to do to build the query:

  1. Calculate the count of records with each meaning for each word, using GROUP BY.

    This will use the secondary index on (word, meaning) that we created above to avoid sorting (using filesort) or materialization (using temporary)

  2. GROUP_CONCAT the concatenated meaning and count from the previous recordset, ordering the values by count in descending order.

    Note that theoretically this could result in very large strings which would not fit into @@group_concat_max_len and get truncated. But we can use it nevertheless: first, there are only 26 meanings per word possible, second, of these 26 meanings we only the first three.

  3. Join the resultset with the dummy table containing 3 records. To do this, we will just concatenate 3 one-record SELECT statement using UNION ALL.

    This is not a very elegant solution, since it does not allow parameterizing the n for the query, but MySQL does not support dummy recordset generation functions like generate_series.

    Alternatively, one could build a dummy table and use a LIMIT to select top values from it. But for the sake of simplicity, I will go with the UNION ALL solution.

  4. Extract the individual values from the concatenated string using SUBSTRING_INDEX
  5. Calculate the ratios, using the number of distinct persons as a total.

    This way, ratio of each word's meaning will be computed correctly, though the sum of ratios can exceed 100% (since one person can give more than one meaning to a word).

And here's the query:

SELECT  word, 4 - lvl AS position, SUBSTRING_INDEX(rec, ':', 1) AS meaning, SUBSTRING_INDEX(rec, ':', -1) / total AS rate
FROM    (
SELECT  word, total, SUBSTRING_INDEX(SUBSTRING_INDEX(tops, ',', -lvl), ',', 1) AS rec, lvl
FROM    (
SELECT  word, total, SUBSTRING_INDEX(ms, ',', 3) AS tops, lvl
FROM    (
SELECT  word,
FROM    (
FROM    t_meaning
) dp
) AS total,
CAST(GROUP_CONCAT(CONCAT_WS(':', meaning, cnt) ORDER BY cnt DESC, meaning DESC) AS CHAR) AS ms
FROM    (
SELECT  word, meaning, COUNT(*) AS cnt
FROM    t_meaning
word, meaning
) q
) q2
SELECT  1 AS lvl
SELECT  2 AS lvl
SELECT  3 AS lvl
) fields
) q3
) q4
word, position
word position meaning rate
1 1 16 0.079456
1 2 17 0.079296
1 3 12 0.079264
2 1 9 0.079584
2 2 10 0.079264
2 3 20 0.078656
16 1 14 0.080064
16 2 18 0.079424
16 3 4 0.078976
48 rows fetched in 0.0017s (1.1094s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 48 100.00 Using filesort
2 DERIVED <derived3> ALL 48 100.00
3 DERIVED <derived8> ALL 3 100.00
3 DERIVED <derived4> ALL 16 100.00 Using join buffer
8 DERIVED No tables used
9 UNION No tables used
10 UNION No tables used
UNION RESULT <union8,9,10> ALL
4 DERIVED <derived7> ALL 416 100.00 Using filesort
7 DERIVED t_meaning index ix_meaning_word_meaning 8 1000443 100.00 Using index
5 SUBQUERY Select tables optimized away
6 DERIVED t_meaning range PRIMARY 4 58850 100.00 Using index for group-by
select `q4`.`word` AS `word`,(4 - `q4`.`lvl`) AS `position`,substring_index(`q4`.`rec`,':',1) AS `meaning`,(substring_index(`q4`.`rec`,':',-(1)) / `q4`.`total`) AS `rate` from (select `q3`.`word` AS `word`,`q3`.`total` AS `total`,substring_index(substring_index(`q3`.`tops`,',',-(`q3`.`lvl`)),',',1) AS `rec`,`q3`.`lvl` AS `lvl` from (select `q2`.`word` AS `word`,`q2`.`total` AS `total`,substring_index(`q2`.`ms`,',',3) AS `tops`,`fields`.`lvl` AS `lvl` from (select `q`.`word` AS `word`,(select count(0) AS `COUNT(*)` from (select distinct `20100318_word`.`t_meaning`.`person` AS `person` from `20100318_word`.`t_meaning`) `dp`) AS `total`,cast(group_concat(concat_ws(':',`q`.`meaning`,`q`.`cnt`) order by `q`.`cnt` DESC,`q`.`meaning` DESC separator ',') as char charset latin1) AS `ms` from (select `20100318_word`.`t_meaning`.`word` AS `word`,`20100318_word`.`t_meaning`.`meaning` AS `meaning`,count(0) AS `cnt` from `20100318_word`.`t_meaning` group by `20100318_word`.`t_meaning`.`word`,`20100318_word`.`t_meaning`.`meaning`) `q` group by `q`.`word`) `q2` join (select 1 AS `lvl` union all select 2 AS `lvl` union all select 3 AS `lvl`) `fields`) `q3`) `q4` order by `q4`.`word`,(4 - `q4`.`lvl`)

Note that we used

FROM    (
FROM    t_meaning
) q

instead of more simple

FROM    t_meaning

The latter looks more elegant, but MySQL is not able to use INDEX FOR GROUP BY for it.

That's why we used the former subquery to calculate the totals, and we can see in the query plan that the INDEX FOR GROUP BY is being used.

The overall query time is 1.10 sec which is just a little more than the time required to iterate all records in the table.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply