EXPLAIN EXTENDED

How to create fast database queries

Bayesian classification

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Read the rest of this entry »

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings
person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

Aggregates and LEFT JOIN

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
sale s
ON      s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10

, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

SQL Server: deleting with self-referential FOREIGN KEY

with 8 comments

From Stack Overflow:

I have an SQL Server table defined as below:

TestComposite
id (PK) siteUrl (PK) name parentId
1 site1 Item1 NULL
2 site1 Item2 NULL
3 site1 Folder1 NULL
4 site1 Folder1.Item1 3
5 site1 Folder1.Item2 3
6 site1 Folder1.Folder1 3
7 site1 Folder1.Folder1.Item1 6

Items and folders are stored inside the same table

If an item is inside a folder, the parentID column is the id of the folder.

I would like to be able to DELETE CASCADE items/folders when I delete a folder.

I tried to define a constraint similar to:

ALTER TABLE [TestComposite]
ADD CONSTRAINT fk_parentid
FOREIGN KEY (ParentID, SiteUrl)
REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE

, but it gives me this error:

Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

SQL Server does support chained CASCADE updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).

SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.

However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 3rd, 2010 at 11:00 pm

Posted in SQL Server

PostgreSQL: using recursive functions in nested sets

with one comment

In the previous article, I discussed a way to improve nested sets model in PostgreSQL.

The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.

This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)).

That solution was pure SQL and it was quite fast, but not optimal.

The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.

We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 2nd, 2010 at 11:00 pm

Posted in PostgreSQL

PostgreSQL: nested sets and R-Tree

Comments enabled. I *really* need your comment

A feedback on one of my previous articles comparing adjacency list and nested sets models for PostgreSQL.

Jay writes:

In your series on adjacency lists vs nested sets, you discuss geometric types and R-Tree indexes in MySQL, but you don't discuss them when discussing the same subject with PostgreSQL, which also has geometric types and R-Tree indexing (mostly available through GiST indexes).

To make it simple I added the following line after the data insertion part of the script at Nested Sets - Postgresql:

ALTER TABLE t_hierarchy ADD COLUMN sets POLYGON;
UPDATE t_hierarchy SET sets = POLYGON(BOX(POINT(-1,lft), POINT(1, rgt)));

It needed to be a POLYGON instead of a BOX since there is a @>(POLYGON,POLYGON) function but no @>(BOX,BOX) function, and the polygon was cast from the box to create the rectangle shape required.

It outperforms the adjacency list on all descendants; outperforms it on all ancestors (not by much); performs reasonably well on all descendants up to a certain level on items with few descendants (e. g. 31415) and badly on items with many descendants (e. g. 42).

It still completes in less than 20 seconds though, which is an improvement over 1 minute.

PostgreSQL does support R-Tree indexes indeed (through GiST interface), and they can be used to improve the efficiency of the nested sets model.

Let's create a sample table and try some of the queries that Jay proposed:
Read the rest of this entry »

Written by Quassnoi

March 1st, 2010 at 11:00 pm

Posted in PostgreSQL

Six degrees of separation

with one comment

Answering questions asked on the site.

Kathy asks:

I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.

If your site grows popular, most probably, they are not. But we better check.

On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.

The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

February 27th, 2010 at 11:00 pm

Posted in PostgreSQL

Efficient circle distance testing

Comments enabled. I *really* need your comment

Answering questions asked on the site.

eptil asks:

I am using SQL Server 2008, but not the spatial features.

I have a table with few entries, only 40,000. There is an id INT PRIMARY KEY column and two columns storing a 2d coordinate, both decimals.

I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:

SELECT id, x, y
FROM mytable t1
WHERE (
SELECT COUNT(*)
FROM mytable t2
WHERE ABS(t1.x - t2.x) < 25 AND ABS(t1.y - t2.y) < 25 ) = 1 [/sourcecode]
This is taking 15 minutes to run at times.

Is there a better way?

Of course using spatial abilities would be a better way, but it is possible to make do with plain SQL. This will also work in SQL Server 2005.

In most database engines, the spatial indexes are implemented as the R-Tree structures. SQL Server, however, uses another approach: surface tesselation.

Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain B-Tree index.

When SQL Server's optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that possibly can satisfy this predicate. Say, if the tiles are defined as squares with side 1, the predicate column.STDistance(@mypoint) < 2 can only be satisfied by the objects within 2 tiles away from @mypoint's tile. This gives a square of 25 tiles with @mypoint's tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.

Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don't have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

February 26th, 2010 at 11:00 pm

Posted in SQL Server

Matching sets: aggregates vs. first miss

Comments enabled. I *really* need your comment

From Stack Overflow:

Here is my schema:

Suppliers
sid sname address
Parts
pid pname color
Catalog
sid pid cost

I need to find the sids of suppliers who supply every red part or every green part.

This task requires matching the sets.

We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.

Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.

There are two ways to check that:

  • First miss technique: test each record from the subset candidate against the superset candidate, returning FALSE if there is no match.
  • Aggregate technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match

Let's test which way is faster in which cases. To do this, we will need some sample tables:
Read the rest of this entry »

Written by Quassnoi

February 25th, 2010 at 11:00 pm

Posted in MySQL

Sargability of monotonic functions: example

with 2 comments

In my previous article I presented a proposal to add sargability of monotonic functions into the SQL engines.

In a nutshell: a monotonic function is a function that preserves the order of the argument so that it gives the larger results for the larger values of the argument. It is easy to prove that a B-tree with each key replaced by the result of the function will remain the valid B-Tree and hence can be used to search for ranges of function results just like it is used to search for ranges of values.

With a little effort, a B-Tree can also be used to search for the ranges of piecewise monotonic functions: those whose domain can be split into a number of continuous pieces with the function being monotonic within each piece (but it may be not monotonic and even not continuous across the pieces).

In this article, I'll demonstrate the algorithm to do that (implemented in pure SQL on PostgreSQL).

I will show how the performance of simple query

SELECT  *
FROM    t_sine
WHERE   SIN(value) BETWEEN 0.1234 AND 0.1235

could be improved if the sargability of monotonic functions had been implemented in the optimizer.
Read the rest of this entry »

Written by Quassnoi

February 23rd, 2010 at 11:00 pm

Posted in PostgreSQL