## Bayesian classification

Comments enabled. I *really* need your comment

From **Stack Overflow**:

Suppose you've visited sites

S0 … S50. All exceptS0are48%female;S0is100%male.I'm guessing your gender, and I want to have a value close to

100%, not just the49%that a straight average would give.Also, consider that most demographics (i.e. everything other than gender) does not have the average at

50%. For example, the average probability of having kids0-17is~37%.The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

`P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A`

^{*}) × P(A^{*}))

, where:

`P(A|B)`

is the posterior probability of the visitor being a male (given that he visited the site)`P(A)`

is the prior probability of the visitor being a male (initially,**50%**)`P(B)`

is the probability of (any Internet user) visiting the site`P(B|A)`

is the probability of a user visiting the site, given that he is a male`P(A`

is the prior probability of the visitor not being a male (initially,^{*})**50%**)`P(B|A`

is the probability of a user visiting the site, given that she is not a male.^{*})

Since a user can only be male or female:

`P(A|B) = P(B|A)×P(A)/P(B) = P(B|A)×P(A) / (P(B|A)×P(A) + (1 - P(B|A))×(1 - P(A)))`

`P(B|A)`

is the number stored in the database (probability of the user being a male).

We consider the events of visiting the different sites to be independent (a fact that the user visited site **A** neither influences nor is influenced by the fact that the user also visited site **B**. This is of course not so, since the sites may exchange links etc., but we make this assumption for the sake of simplicity.

So, given a series of the sites, we take the initial probability (`P`

) and recursively substitute it into the following formula:_{0} = 0.5

`P`

_{n} = S_{n}×P_{n-1} / (S_{n}×P_{n-1} + (1 - S_{n})×(1 - P_{n-1}))

Simple calculations show us the the recursion (which **MySQL** is not good at) can be replaced with an aggregate formula:

`P = P0 * PROD(S) / (P0 * PROD(S) + (1 - P0) * PROD(1 - S))`

, where `PROD`

is the aggregate product of the sites' probabilities of their visitors being a male.

**SQL** does not have a built-in aggregate product function, but it can be easily replaced by the aggregate sum on the logarithmic scale:

`P = P0 * EXP(SUM(LN(S))) / (P0 * EXP(SUM(LN(S))) + (1 - P0) * EXP(SUM(LN(1 - S))))`

Given all that, let's create some sample tables:

There are **1,000** users and **1,000** sites. The sites are assigned with maleness

from **0.03** to **0.97**.

User randomly visit the sites according to their gender and the site gender distribution. There are **25** visits per user in average.

Let's try to guess the users' gender and return only wrong guesses.

We will assume that the user is male when the posterior probability of the user being male is more than **0.99**, female if that is less than **0.01**, undefined if within **0.01** and **0.99**:

SELECT *, CASE WHEN posterior < 0.01 THEN 'F' WHEN posterior > 0.99 THEN 'M' ELSE 'U' END AS guessed FROM ( SELECT u.*, prior * EXP(SUM(LN(male))) / (prior * EXP(SUM(LN(male))) + (1 - prior) * EXP(SUM(LN(1 - male)))) AS posterior FROM ( SELECT 0.5 AS prior ) vars CROSS JOIN t_user u LEFT JOIN t_visit v ON v.u_id = u.id LEFT JOIN t_site s ON s.id = v.s_id GROUP BY u.id ) q HAVING guessed <> gender

id | name | gender | posterior | guessed | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

51 | User 51 | F | 0.652234131074669 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

53 | User 53 | F | 0.87625067361204 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

94 | User 94 | M | 0.732238662361337 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

264 | User 264 | F | 0.0520209347475727 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

475 | User 475 | M | 0.974230285094509 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

497 | User 497 | M | 0.966568719694869 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

542 | User 542 | F | 0.0685609699288645 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

595 | User 595 | M | 0.984478426560255 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

742 | User 742 | F | 0.0334681988009631 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

768 | User 768 | M | 0.960799229888108 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

800 | User 800 | F | 0.0181411777994256 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

867 | User 867 | F | 0.0401728770664721 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

882 | User 882 | M | 0.884671868426923 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

902 | User 902 | F | 0.802525467489821 | U | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

14 rows fetched in 0.0006s (0.1989s) |

id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|

1 | PRIMARY | <derived2> | ALL | 1000 | 100.00 | |||||

2 | DERIVED | <derived3> | system | 1 | 100.00 | Using temporary; Using filesort | ||||

2 | DERIVED | u | ALL | 871 | 100.00 | |||||

2 | DERIVED | v | ref | PRIMARY | PRIMARY | 4 | 20100325_bayes.u.id | 12 | 100.00 | Using index |

2 | DERIVED | s | eq_ref | PRIMARY | PRIMARY | 4 | 20100325_bayes.v.s_id | 1 | 100.00 | |

3 | DERIVED | No tables used |

select `q`.`id` AS `id`,`q`.`name` AS `name`,`q`.`gender` AS `gender`,`q`.`posterior` AS `posterior`,(case when (`q`.`posterior` < 0.01) then 'F' when (`q`.`posterior` > 0.99) then 'M' else 'U' end) AS `guessed` from (select `20100325_bayes`.`u`.`id` AS `id`,`20100325_bayes`.`u`.`name` AS `name`,`20100325_bayes`.`u`.`gender` AS `gender`,(('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) / (('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) + ((1 - '0.5') * exp(sum(ln((1 - `20100325_bayes`.`s`.`male`))))))) AS `posterior` from (select 0.5 AS `prior`) `vars` join `20100325_bayes`.`t_user` `u` left join `20100325_bayes`.`t_visit` `v` on((`20100325_bayes`.`v`.`u_id` = `20100325_bayes`.`u`.`id`)) left join `20100325_bayes`.`t_site` `s` on((`20100325_bayes`.`s`.`id` = `20100325_bayes`.`v`.`s_id`)) where 1 group by `20100325_bayes`.`u`.`id`) `q` having (convert(`guessed` using utf8) <> `q`.`gender`)

From **1,000** users, we only have **14** results outside the credible interval of **99%** and all of these are undefined rather than false.