Bayesian classification at EXPLAIN EXTENDED

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A^*) × P(A^*))

, where:

P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
P(A) is the prior probability of the visitor being a male (initially, 50%)
P(B) is the probability of (any Internet user) visiting the site
P(B|A) is the probability of a user visiting the site, given that he is a male
P(A^*) is the prior probability of the visitor not being a male (initially, 50%)
P(B|A^*) is the probability of a user visiting the site, given that she is not a male.

Since a user can only be male or female:

P(A|B) = P(B|A)×P(A)/P(B) = P(B|A)×P(A) / (P(B|A)×P(A) + (1 - P(B|A))×(1 - P(A)))

P(B|A) is the number stored in the database (probability of the user being a male).

We consider the events of visiting the different sites to be independent (a fact that the user visited site A neither influences nor is influenced by the fact that the user also visited site B. This is of course not so, since the sites may exchange links etc., but we make this assumption for the sake of simplicity.

So, given a series of the sites, we take the initial probability (P₀ = 0.5) and recursively substitute it into the following formula:

P_n = S_n×P_n-1 / (S_n×P_n-1 + (1 - S_n)×(1 - P_n-1))

Simple calculations show us the the recursion (which MySQL is not good at) can be replaced with an aggregate formula:

P = P0 * PROD(S) / (P0 * PROD(S) + (1 - P0) * PROD(1 - S))

, where PROD is the aggregate product of the sites' probabilities of their visitors being a male.

SQL does not have a built-in aggregate product function, but it can be easily replaced by the aggregate sum on the logarithmic scale:

P = P0 * EXP(SUM(LN(S))) / (P0 * EXP(SUM(LN(S))) + (1 - P0) * EXP(SUM(LN(1 - S))))

Given all that, let's create some sample tables:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_user (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20) NOT NULL,
gender CHAR(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_site (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20) NOT NULL,
male DOUBLE NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_visit (
u_id INT NOT NULL,
s_id INT NOT NULL,
PRIMARY KEY (u_id, s_id)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000);
COMMIT;

INSERT
INTO    t_user
SELECT  id, CONCAT('User ', id),
        CASE WHEN RAND(20100325) &gt; 0.5 THEN 'M' ELSE 'F' END
FROM    filler;

INSERT
INTO    t_site
SELECT  id, CONCAT('Site ', id),
RAND(20100325 &lt;&lt; 1) * 0.94 + 0.03
FROM    filler;

INSERT
INTO    t_visit
SELECT  u_id, s_id
FROM    (
        SELECT  u.id AS u_id, s.id AS s_id, u.gender, s.male,
                RAND(20100325 &lt;&lt; 2) AS rnds,
                RAND(20100325 &lt;&lt; 3) AS rndm
        FROM    t_user u
        CROSS JOIN
                t_site s
        ) q
WHERE   rnds &lt; 0.05
        AND rndm &lt; CASE gender WHEN 'M' THEN male ELSE 1 - male END;

There are 1,000 users and 1,000 sites. The sites are assigned with maleness from 0.03 to 0.97.

User randomly visit the sites according to their gender and the site gender distribution. There are 25 visits per user in average.

Let's try to guess the users' gender and return only wrong guesses.

We will assume that the user is male when the posterior probability of the user being male is more than 0.99, female if that is less than 0.01, undefined if within 0.01 and 0.99:

SELECT  *, CASE WHEN posterior &lt; 0.01 THEN 'F' WHEN posterior &gt; 0.99 THEN 'M' ELSE 'U' END AS guessed
FROM    (
SELECT  u.*,
prior * EXP(SUM(LN(male))) / (prior * EXP(SUM(LN(male))) + (1 - prior) * EXP(SUM(LN(1 - male)))) AS posterior
FROM    (
SELECT  0.5 AS prior
) vars
CROSS JOIN
t_user u
LEFT JOIN
t_visit v
ON      v.u_id = u.id
LEFT JOIN
t_site s
ON      s.id = v.s_id
GROUP BY
u.id
) q
HAVING  guessed &lt;&gt; gender

id	name	gender	posterior	guessed
51	User 51	F	0.652234131074669	U
53	User 53	F	0.87625067361204	U
94	User 94	M	0.732238662361337	U
264	User 264	F	0.0520209347475727	U
475	User 475	M	0.974230285094509	U
497	User 497	M	0.966568719694869	U
542	User 542	F	0.0685609699288645	U
595	User 595	M	0.984478426560255	U
742	User 742	F	0.0334681988009631	U
768	User 768	M	0.960799229888108	U
800	User 800	F	0.0181411777994256	U
867	User 867	F	0.0401728770664721	U
882	User 882	M	0.884671868426923	U
902	User 902	F	0.802525467489821	U
14 rows fetched in 0.0006s (0.1989s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					1000	100.00
2	DERIVED	<derived3>	system					1	100.00	Using temporary; Using filesort
2	DERIVED	u	ALL					871	100.00
2	DERIVED	v	ref	PRIMARY	PRIMARY	4	20100325_bayes.u.id	12	100.00	Using index
2	DERIVED	s	eq_ref	PRIMARY	PRIMARY	4	20100325_bayes.v.s_id	1	100.00
3	DERIVED									No tables used

select `q`.`id` AS `id`,`q`.`name` AS `name`,`q`.`gender` AS `gender`,`q`.`posterior` AS `posterior`,(case when (`q`.`posterior` < 0.01) then 'F' when (`q`.`posterior` > 0.99) then 'M' else 'U' end) AS `guessed` from (select `20100325_bayes`.`u`.`id` AS `id`,`20100325_bayes`.`u`.`name` AS `name`,`20100325_bayes`.`u`.`gender` AS `gender`,(('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) / (('0.5' * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) + ((1 - '0.5') * exp(sum(ln((1 - `20100325_bayes`.`s`.`male`))))))) AS `posterior` from (select 0.5 AS `prior`) `vars` join `20100325_bayes`.`t_user` `u` left join `20100325_bayes`.`t_visit` `v` on((`20100325_bayes`.`v`.`u_id` = `20100325_bayes`.`u`.`id`)) left join `20100325_bayes`.`t_site` `s` on((`20100325_bayes`.`s`.`id` = `20100325_bayes`.`v`.`s_id`)) where 1 group by `20100325_bayes`.`u`.`id`) `q` having (convert(`guessed` using utf8) <> `q`.`gender`)

From 1,000 users, we only have 14 results outside the credible interval of 99% and all of these are undefined rather than false.

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED

Bayesian classification

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow