EXPLAIN EXTENDED

How to create fast database queries

Archive for March 18th, 2010

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings
person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL