# EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

## Bayesian classification

From Stack Overflow:

Suppose you’ve visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I’m guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site’s demographics are different from this average (e.g. maybe it’s a site for parents, or for child-free people), the more it should count in my guess of your status.

What’s the best way to calculate this?

This is a classical application of Bayes’ Theorem.

The formula to calculate the posterior probability is:

`P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))`

, where:

• `P(A|B)` is the posterior probability of the visitor being a male (given that he visited the site)
• `P(A)` is the prior probability of the visitor being a male (initially, 50%)
• `P(B)` is the probability of (any Internet user) visiting the site
• `P(B|A)` is the probability of a user visiting the site, given that he is a male
• `P(A*)` is the prior probability of the visitor not being a male (initially, 50%)
• `P(B|A*)` is the probability of a user visiting the site, given that she is not a male.

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL