Answering questions asked on the site.
Vlad Enache asks:
In MySQL I have a table called
meaningswith three columns:
meanings person word meaning 1 1 4 1 2 19 1 2 7 1 3 5
wordhas 16 possible values,
A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.
There will be thousands of persons.
I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:
- word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
- word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)
Is it possible to solve this with a single MySQL query?
This task is a typical greatest-n-per-group problem.
Earlier I described some solutions to it, one using session variables, and another one using
LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the
JOINs, and using the
LIMIT solution would be inefficient, since the aggregates can not be indexed.
Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).
Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and
Let’s create a sample table:
Table creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_meaning ( person INT NOT NULL, word INT NOT NULL, meaning INT NOT NULL, PRIMARY KEY (person, word, meaning), KEY ix_meaning_word_meaning (word, meaning) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_meaning (person, word, meaning) SELECT (id - 1) div 32 + 1, (id - 1) % 16 + 1, ((((id - 1) div 16) % 2) * 13) + CEILING(RAND(20100318 << 1) * 13) FROM filler;
The table contains 1,000,000 records for 32,500 people, each giving two random meanings to each of the 16 words.
All three fields comprise a
PRIMARY KEY, and there is an additional index on
Here’s what we need to do to build the query:
Calculate the count of records with each meaning for each word, using
This will use the secondary index on
(word, meaning)that we created above to avoid sorting (
using filesort) or materialization (
GROUP_CONCATthe concatenated meaning and count from the previous recordset, ordering the values by count in descending order.
Note that theoretically this could result in very large strings which would not fit into
@@group_concat_max_lenand get truncated. But we can use it nevertheless: first, there are only 26 meanings per word possible, second, of these 26 meanings we only the first three.
Join the resultset with the dummy table containing 3 records. To do this, we will just concatenate 3 one-record
This is not a very elegant solution, since it does not allow parameterizing the
nfor the query, but MySQL does not support dummy recordset generation functions like
Alternatively, one could build a dummy table and use a
LIMITto select top values from it. But for the sake of simplicity, I will go with the
- Extract the individual values from the concatenated string using
Calculate the ratios, using the number of distinct persons as a total.
This way, ratio of each word’s meaning will be computed correctly, though the sum of ratios can exceed 100% (since one person can give more than one meaning to a word).
And here’s the query:
SELECT word, 4 - lvl AS position, SUBSTRING_INDEX(rec, ':', 1) AS meaning, SUBSTRING_INDEX(rec, ':', -1) / total AS rate FROM ( SELECT word, total, SUBSTRING_INDEX(SUBSTRING_INDEX(tops, ',', -lvl), ',', 1) AS rec, lvl FROM ( SELECT word, total, SUBSTRING_INDEX(ms, ',', 3) AS tops, lvl FROM ( SELECT word, ( SELECT COUNT(*) FROM ( SELECT DISTINCT person FROM t_meaning ) dp ) AS total, CAST(GROUP_CONCAT(CONCAT_WS(':', meaning, cnt) ORDER BY cnt DESC, meaning DESC) AS CHAR) AS ms FROM ( SELECT word, meaning, COUNT(*) AS cnt FROM t_meaning GROUP BY word, meaning ) q GROUP BY word ) q2 CROSS JOIN ( SELECT 1 AS lvl UNION ALL SELECT 2 AS lvl UNION ALL SELECT 3 AS lvl ) fields ) q3 ) q4 ORDER BY word, position
|48 rows fetched in 0.0017s (1.1094s)|
|3||DERIVED||<derived4>||ALL||16||100.00||Using join buffer|
|8||DERIVED||No tables used|
|9||UNION||No tables used|
|10||UNION||No tables used|
|5||SUBQUERY||Select tables optimized away|
|6||DERIVED||t_meaning||range||PRIMARY||4||58850||100.00||Using index for group-by|
select `q4`.`word` AS `word`,(4 - `q4`.`lvl`) AS `position`,substring_index(`q4`.`rec`,':',1) AS `meaning`,(substring_index(`q4`.`rec`,':',-(1)) / `q4`.`total`) AS `rate` from (select `q3`.`word` AS `word`,`q3`.`total` AS `total`,substring_index(substring_index(`q3`.`tops`,',',-(`q3`.`lvl`)),',',1) AS `rec`,`q3`.`lvl` AS `lvl` from (select `q2`.`word` AS `word`,`q2`.`total` AS `total`,substring_index(`q2`.`ms`,',',3) AS `tops`,`fields`.`lvl` AS `lvl` from (select `q`.`word` AS `word`,(select count(0) AS `COUNT(*)` from (select distinct `20100318_word`.`t_meaning`.`person` AS `person` from `20100318_word`.`t_meaning`) `dp`) AS `total`,cast(group_concat(concat_ws(':',`q`.`meaning`,`q`.`cnt`) order by `q`.`cnt` DESC,`q`.`meaning` DESC separator ',') as char charset latin1) AS `ms` from (select `20100318_word`.`t_meaning`.`word` AS `word`,`20100318_word`.`t_meaning`.`meaning` AS `meaning`,count(0) AS `cnt` from `20100318_word`.`t_meaning` group by `20100318_word`.`t_meaning`.`word`,`20100318_word`.`t_meaning`.`meaning`) `q` group by `q`.`word`) `q2` join (select 1 AS `lvl` union all select 2 AS `lvl` union all select 3 AS `lvl`) `fields`) `q3`) `q4` order by `q4`.`word`,(4 - `q4`.`lvl`)
Note that we used
SELECT COUNT(*) FROM ( SELECT DISTINCT person FROM t_meaning ) q
instead of more simple
SELECT COUNT(DISTINCT person) FROM t_meaning
The latter looks more elegant, but MySQL is not able to use
INDEX FOR GROUP BY for it.
That’s why we used the former subquery to calculate the totals, and we can see in the query plan that the
INDEX FOR GROUP BY is being used.
The overall query time is 1.10 sec which is just a little more than the time required to iterate all records in the table.
Hope that helps.
I’m always glad to answer the questions regarding database queries.