Archive for March, 2010
Bayesian classification
Comments enabled. I *really* need your comment
From Stack Overflow:
Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.
I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.
The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
This is a classical application of Bayes' Theorem.
The formula to calculate the posterior probability is:
P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))
, where:
P(A|B)
is the posterior probability of the visitor being a male (given that he visited the site)P(A)
is the prior probability of the visitor being a male (initially, 50%)P(B)
is the probability of (any Internet user) visiting the siteP(B|A)
is the probability of a user visiting the site, given that he is a maleP(A*)
is the prior probability of the visitor not being a male (initially, 50%)P(B|A*)
is the probability of a user visiting the site, given that she is not a male.
Greatest N per group: dealing with aggregates
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Vlad Enache asks:
In MySQL I have a table called
meanings
with three columns:
meanings person word meaning 1 1 4 1 2 19 1 2 7 1 3 5
word
has 16 possible values,meaning
has 26.A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.
There will be thousands of persons.
I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:
- word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
- word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)
etc.
Is it possible to solve this with a single MySQL query?
This task is a typical greatest-n-per-group problem.
Earlier I described some solutions to it, one using session variables, and another one using LIMIT
in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs
, and using the LIMIT
solution would be inefficient, since the aggregates can not be indexed.
Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).
Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT
.
Let's create a sample table:
Read the rest of this entry »
Aggregates and LEFT JOIN
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table
product
with products and tablesale
with all sale operations that were done on these products.I would like to get 10 most often sold products today and what I did is this:
SELECT p.*, COUNT(s.id) AS sumsell FROM product p LEFT JOIN sale s ON s.product_id = p.id AND s.dt >= '2010-01-01' AND s.dt < '2010-01-02' GROUP BY p.id ORDER BY sumsell DESC LIMIT 10, but performance of it is very slow.
What can I do to increase performance of this particular query?
The query involves a LEFT JOIN
which in MySQL world means that products
will be made leading in the query. Each record of product
will be taken and checked against sale
table to find out the number of matching records. If no matching records are found, 0 is returned.
Let's create the sample tables:
Read the rest of this entry »
SQL Server: deleting with self-referential FOREIGN KEY
From Stack Overflow:
I have an SQL Server table defined as below:
TestComposite id (PK) siteUrl (PK) name parentId 1 site1 Item1 NULL 2 site1 Item2 NULL 3 site1 Folder1 NULL 4 site1 Folder1.Item1 3 5 site1 Folder1.Item2 3 6 site1 Folder1.Folder1 3 7 site1 Folder1.Folder1.Item1 6 Items and folders are stored inside the same table
If an item is inside a folder, the
parentID
column is theid
of the folder.I would like to be able to
DELETE CASCADE
items/folders when I delete a folder.I tried to define a constraint similar to:
ALTER TABLE [TestComposite] ADD CONSTRAINT fk_parentid FOREIGN KEY (ParentID, SiteUrl) REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE, but it gives me this error:
Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
SQL Server does support chained CASCADE
updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).
SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.
However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.
Let's create a sample table:
Read the rest of this entry »
PostgreSQL: using recursive functions in nested sets
In the previous article, I discussed a way to improve nested sets model in PostgreSQL.
The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.
This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)
).
That solution was pure SQL and it was quite fast, but not optimal.
The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.
We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.
Let's create a sample table:
Read the rest of this entry »
PostgreSQL: nested sets and R-Tree
Comments enabled. I *really* need your comment
A feedback on one of my previous articles comparing adjacency list and nested sets models for PostgreSQL.
Jay writes:
In your series on adjacency lists vs nested sets, you discuss geometric types and R-Tree indexes in MySQL, but you don't discuss them when discussing the same subject with PostgreSQL, which also has geometric types and R-Tree indexing (mostly available through GiST indexes).
To make it simple I added the following line after the data insertion part of the script at Nested Sets - Postgresql:
ALTER TABLE t_hierarchy ADD COLUMN sets POLYGON; UPDATE t_hierarchy SET sets = POLYGON(BOX(POINT(-1,lft), POINT(1, rgt)));It needed to be a
POLYGON
instead of aBOX
since there is a@>(POLYGON,POLYGON)
function but no@>(BOX,BOX)
function, and the polygon was cast from the box to create the rectangle shape required.It outperforms the adjacency list on
all descendants; outperforms it onall ancestors(not by much); performs reasonably well onall descendants up to a certain levelon items with few descendants (e. g. 31415) and badly on items with many descendants (e. g. 42).It still completes in less than 20 seconds though, which is an improvement over 1 minute.
PostgreSQL does support R-Tree indexes indeed (through GiST interface), and they can be used to improve the efficiency of the nested sets model.
Let's create a sample table and try some of the queries that Jay proposed:
Read the rest of this entry »