How to create fast database queries

Bayesian classification

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Since a user can only be male or female:

P(A|B) = P(B|A)×P(A)/P(B) = P(B|A)×P(A) / (P(B|A)×P(A) + (1 - P(B|A))×(1 - P(A)))

P(B|A) is the number stored in the database (probability of the user being a male).

We consider the events of visiting the different sites to be independent (a fact that the user visited site A neither influences nor is influenced by the fact that the user also visited site B. This is of course not so, since the sites may exchange links etc., but we make this assumption for the sake of simplicity.

So, given a series of the sites, we take the initial probability (P0 = 0.5) and recursively substitute it into the following formula:

Pn = Sn×Pn-1 / (Sn×Pn-1 + (1 - Sn)×(1 - Pn-1))

Simple calculations show us the the recursion (which MySQL is not good at) can be replaced with an aggregate formula:

P = P0 * PROD(S) / (P0 * PROD(S) + (1 - P0) * PROD(1 - S))

, where PROD is the aggregate product of the sites' probabilities of their visitors being a male.

SQL does not have a built-in aggregate product function, but it can be easily replaced by the aggregate sum on the logarithmic scale:

P = P0 * EXP(SUM(LN(S))) / (P0 * EXP(SUM(LN(S))) + (1 - P0) * EXP(SUM(LN(1 - S))))

Given all that, let's create some sample tables:

Table creation details

There are 1,000 users and 1,000 sites. The sites are assigned with maleness from 0.03 to 0.97.

User randomly visit the sites according to their gender and the site gender distribution. There are 25 visits per user in average.

Let's try to guess the users' gender and return only wrong guesses.

We will assume that the user is male when the posterior probability of the user being male is more than 0.99, female if that is less than 0.01, undefined if within 0.01 and 0.99:

SELECT  *, CASE WHEN posterior < 0.01 THEN 'F' WHEN posterior > 0.99 THEN 'M' ELSE 'U' END AS guessed
FROM    (
SELECT  u.*,
prior * EXP(SUM(LN(male))) / (prior * EXP(SUM(LN(male))) + (1 - prior) * EXP(SUM(LN(1 - male)))) AS posterior
FROM    (
SELECT  0.5 AS prior
) vars
t_user u
t_visit v
ON      v.u_id = u.id
t_site s
ON      s.id = v.s_id
) q
HAVING  guessed <> gender
id name gender posterior guessed
51 User 51 F 0.652234131074669 U
53 User 53 F 0.87625067361204 U
94 User 94 M 0.732238662361337 U
264 User 264 F 0.0520209347475727 U
475 User 475 M 0.974230285094509 U
497 User 497 M 0.966568719694869 U
542 User 542 F 0.0685609699288645 U
595 User 595 M 0.984478426560255 U
742 User 742 F 0.0334681988009631 U
768 User 768 M 0.960799229888108 U
800 User 800 F 0.0181411777994256 U
867 User 867 F 0.0401728770664721 U
882 User 882 M 0.884671868426923 U
902 User 902 F 0.802525467489821 U
14 rows fetched in 0.0006s (0.1989s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 1000 100.00
2 DERIVED <derived3> system 1 100.00 Using temporary; Using filesort
2 DERIVED u ALL 871 100.00
2 DERIVED v ref PRIMARY PRIMARY 4 20100325_bayes.u.id 12 100.00 Using index
2 DERIVED s eq_ref PRIMARY PRIMARY 4 20100325_bayes.v.s_id 1 100.00
3 DERIVED No tables used
select `q`.`id` AS `id`,`q`.`name` AS `name`,`q`.`gender` AS `gender`,`q`.`posterior` AS `posterior`,(case when (`q`.`posterior` < 0.01) then 'F' when (`q`.`posterior` > 0.99) then 'M' else 'U' end) AS `guessed` from (select `20100325_bayes`.`u`.`id` AS `id`,`20100325_bayes`.`u`.`name` AS `name`,`20100325_bayes`.`u`.`gender` AS `gender`,(('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) / (('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) + ((1 - '0.5') * exp(sum(ln((1 - `20100325_bayes`.`s`.`male`))))))) AS `posterior` from (select 0.5 AS `prior`) `vars` join `20100325_bayes`.`t_user` `u` left join `20100325_bayes`.`t_visit` `v` on((`20100325_bayes`.`v`.`u_id` = `20100325_bayes`.`u`.`id`)) left join `20100325_bayes`.`t_site` `s` on((`20100325_bayes`.`s`.`id` = `20100325_bayes`.`v`.`s_id`)) where 1 group by `20100325_bayes`.`u`.`id`) `q` having (convert(`guessed` using utf8) <> `q`.`gender`)

From 1,000 users, we only have 14 results outside the credible interval of 99% and all of these are undefined rather than false.

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply