Archive for March 25th, 2010
Bayesian classification
Comments enabled. I *really* need your comment
From Stack Overflow:
Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.
I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.
The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
This is a classical application of Bayes' Theorem.
The formula to calculate the posterior probability is:
P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))
, where:
P(A|B)
is the posterior probability of the visitor being a male (given that he visited the site)P(A)
is the prior probability of the visitor being a male (initially, 50%)P(B)
is the probability of (any Internet user) visiting the siteP(B|A)
is the probability of a user visiting the site, given that he is a maleP(A*)
is the prior probability of the visitor not being a male (initially, 50%)P(B|A*)
is the probability of a user visiting the site, given that she is not a male.