EXPLAIN EXTENDED

How to create fast database queries

Archive for March, 2010

Bayesian classification

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Read the rest of this entry »

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings
person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

Aggregates and LEFT JOIN

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
sale s
ON      s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10

, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

SQL Server: deleting with self-referential FOREIGN KEY

with 8 comments

From Stack Overflow:

I have an SQL Server table defined as below:

TestComposite
id (PK) siteUrl (PK) name parentId
1 site1 Item1 NULL
2 site1 Item2 NULL
3 site1 Folder1 NULL
4 site1 Folder1.Item1 3
5 site1 Folder1.Item2 3
6 site1 Folder1.Folder1 3
7 site1 Folder1.Folder1.Item1 6

Items and folders are stored inside the same table

If an item is inside a folder, the parentID column is the id of the folder.

I would like to be able to DELETE CASCADE items/folders when I delete a folder.

I tried to define a constraint similar to:

ALTER TABLE [TestComposite]
ADD CONSTRAINT fk_parentid
FOREIGN KEY (ParentID, SiteUrl)
REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE

, but it gives me this error:

Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

SQL Server does support chained CASCADE updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).

SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.

However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 3rd, 2010 at 11:00 pm

Posted in SQL Server

PostgreSQL: using recursive functions in nested sets

with one comment

In the previous article, I discussed a way to improve nested sets model in PostgreSQL.

The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.

This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)).

That solution was pure SQL and it was quite fast, but not optimal.

The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.

We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: nested sets and R-Tree

Comments enabled. I *really* need your comment

A feedback on one of my previous articles comparing adjacency list and nested sets models for PostgreSQL.

Jay writes:

In your series on adjacency lists vs nested sets, you discuss geometric types and R-Tree indexes in MySQL, but you don't discuss them when discussing the same subject with PostgreSQL, which also has geometric types and R-Tree indexing (mostly available through GiST indexes).

To make it simple I added the following line after the data insertion part of the script at Nested Sets - Postgresql:

ALTER TABLE t_hierarchy ADD COLUMN sets POLYGON;
UPDATE t_hierarchy SET sets = POLYGON(BOX(POINT(-1,lft), POINT(1, rgt)));

It needed to be a POLYGON instead of a BOX since there is a @>(POLYGON,POLYGON) function but no @>(BOX,BOX) function, and the polygon was cast from the box to create the rectangle shape required.

It outperforms the adjacency list on all descendants; outperforms it on all ancestors (not by much); performs reasonably well on all descendants up to a certain level on items with few descendants (e. g. 31415) and badly on items with many descendants (e. g. 42).

It still completes in less than 20 seconds though, which is an improvement over 1 minute.

PostgreSQL does support R-Tree indexes indeed (through GiST interface), and they can be used to improve the efficiency of the nested sets model.

Let's create a sample table and try some of the queries that Jay proposed:
Read the rest of this entry »

Written by Quassnoi

March 1st, 2010 at 11:00 pm

Posted in PostgreSQL