Bayesian classification
Comments enabled. I *really* need your comment
From Stack Overflow:
Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.
I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.
The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
This is a classical application of Bayes' Theorem.
The formula to calculate the posterior probability is:
P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))
, where:
P(A|B)
is the posterior probability of the visitor being a male (given that he visited the site)P(A)
is the prior probability of the visitor being a male (initially, 50%)P(B)
is the probability of (any Internet user) visiting the siteP(B|A)
is the probability of a user visiting the site, given that he is a maleP(A*)
is the prior probability of the visitor not being a male (initially, 50%)P(B|A*)
is the probability of a user visiting the site, given that she is not a male.
Since a user can only be male or female:
P(A|B) = P(B|A)×P(A)/P(B) = P(B|A)×P(A) / (P(B|A)×P(A) + (1 - P(B|A))×(1 - P(A)))
P(B|A)
is the number stored in the database (probability of the user being a male).
We consider the events of visiting the different sites to be independent (a fact that the user visited site A neither influences nor is influenced by the fact that the user also visited site B. This is of course not so, since the sites may exchange links etc., but we make this assumption for the sake of simplicity.
So, given a series of the sites, we take the initial probability (P0 = 0.5
) and recursively substitute it into the following formula:
Pn = Sn×Pn-1 / (Sn×Pn-1 + (1 - Sn)×(1 - Pn-1))
Simple calculations show us the the recursion (which MySQL is not good at) can be replaced with an aggregate formula:
P = P0 * PROD(S) / (P0 * PROD(S) + (1 - P0) * PROD(1 - S))
, where PROD
is the aggregate product of the sites' probabilities of their visitors being a male.
SQL does not have a built-in aggregate product function, but it can be easily replaced by the aggregate sum on the logarithmic scale:
P = P0 * EXP(SUM(LN(S))) / (P0 * EXP(SUM(LN(S))) + (1 - P0) * EXP(SUM(LN(1 - S))))
Given all that, let's create some sample tables:
There are 1,000 users and 1,000 sites. The sites are assigned with maleness
from 0.03 to 0.97.
User randomly visit the sites according to their gender and the site gender distribution. There are 25 visits per user in average.
Let's try to guess the users' gender and return only wrong guesses.
We will assume that the user is male when the posterior probability of the user being male is more than 0.99, female if that is less than 0.01, undefined if within 0.01 and 0.99:
SELECT *, CASE WHEN posterior < 0.01 THEN 'F' WHEN posterior > 0.99 THEN 'M' ELSE 'U' END AS guessed FROM ( SELECT u.*, prior * EXP(SUM(LN(male))) / (prior * EXP(SUM(LN(male))) + (1 - prior) * EXP(SUM(LN(1 - male)))) AS posterior FROM ( SELECT 0.5 AS prior ) vars CROSS JOIN t_user u LEFT JOIN t_visit v ON v.u_id = u.id LEFT JOIN t_site s ON s.id = v.s_id GROUP BY u.id ) q HAVING guessed <> gender
id | name | gender | posterior | guessed | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51 | User 51 | F | 0.652234131074669 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
53 | User 53 | F | 0.87625067361204 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
94 | User 94 | M | 0.732238662361337 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
264 | User 264 | F | 0.0520209347475727 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
475 | User 475 | M | 0.974230285094509 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
497 | User 497 | M | 0.966568719694869 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
542 | User 542 | F | 0.0685609699288645 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
595 | User 595 | M | 0.984478426560255 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
742 | User 742 | F | 0.0334681988009631 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
768 | User 768 | M | 0.960799229888108 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
800 | User 800 | F | 0.0181411777994256 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
867 | User 867 | F | 0.0401728770664721 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
882 | User 882 | M | 0.884671868426923 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
902 | User 902 | F | 0.802525467489821 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 rows fetched in 0.0006s (0.1989s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 1000 | 100.00 | |||||
2 | DERIVED | <derived3> | system | 1 | 100.00 | Using temporary; Using filesort | ||||
2 | DERIVED | u | ALL | 871 | 100.00 | |||||
2 | DERIVED | v | ref | PRIMARY | PRIMARY | 4 | 20100325_bayes.u.id | 12 | 100.00 | Using index |
2 | DERIVED | s | eq_ref | PRIMARY | PRIMARY | 4 | 20100325_bayes.v.s_id | 1 | 100.00 | |
3 | DERIVED | No tables used |
select `q`.`id` AS `id`,`q`.`name` AS `name`,`q`.`gender` AS `gender`,`q`.`posterior` AS `posterior`,(case when (`q`.`posterior` < 0.01) then 'F' when (`q`.`posterior` > 0.99) then 'M' else 'U' end) AS `guessed` from (select `20100325_bayes`.`u`.`id` AS `id`,`20100325_bayes`.`u`.`name` AS `name`,`20100325_bayes`.`u`.`gender` AS `gender`,(('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) / (('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) + ((1 - '0.5') * exp(sum(ln((1 - `20100325_bayes`.`s`.`male`))))))) AS `posterior` from (select 0.5 AS `prior`) `vars` join `20100325_bayes`.`t_user` `u` left join `20100325_bayes`.`t_visit` `v` on((`20100325_bayes`.`v`.`u_id` = `20100325_bayes`.`u`.`id`)) left join `20100325_bayes`.`t_site` `s` on((`20100325_bayes`.`s`.`id` = `20100325_bayes`.`v`.`s_id`)) where 1 group by `20100325_bayes`.`u`.`id`) `q` having (convert(`guessed` using utf8) <> `q`.`gender`)
From 1,000 users, we only have 14 results outside the credible interval of 99% and all of these are undefined rather than false.