EXPLAIN EXTENDED

How to create fast database queries

Archive for March 25th, 2010

Bayesian classification

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Read the rest of this entry »

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL