Author Archive
NoSQL
I had a vision tonight.
A huge, dark, grim figure approached me, seized me with its long bony arms and made me see all the vanity of the world we are living in.
Bloated database engines, useless ACID requirements, meaningless joins are now in the past for me.
I decided to move to NoSQL.
Where do I begin?
Bayesian classification
Comments enabled. I *really* need your comment
From Stack Overflow:
Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.
I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.
The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
This is a classical application of Bayes' Theorem.
The formula to calculate the posterior probability is:
P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))
, where:
P(A|B)
is the posterior probability of the visitor being a male (given that he visited the site)P(A)
is the prior probability of the visitor being a male (initially, 50%)P(B)
is the probability of (any Internet user) visiting the siteP(B|A)
is the probability of a user visiting the site, given that he is a maleP(A*)
is the prior probability of the visitor not being a male (initially, 50%)P(B|A*)
is the probability of a user visiting the site, given that she is not a male.
Greatest N per group: dealing with aggregates
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Vlad Enache asks:
In MySQL I have a table called
meanings
with three columns:
meanings person word meaning 1 1 4 1 2 19 1 2 7 1 3 5
word
has 16 possible values,meaning
has 26.A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.
There will be thousands of persons.
I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:
- word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
- word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)
etc.
Is it possible to solve this with a single MySQL query?
This task is a typical greatest-n-per-group problem.
Earlier I described some solutions to it, one using session variables, and another one using LIMIT
in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs
, and using the LIMIT
solution would be inefficient, since the aggregates can not be indexed.
Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).
Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT
.
Let's create a sample table:
Read the rest of this entry »
Aggregates and LEFT JOIN
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table
product
with products and tablesale
with all sale operations that were done on these products.I would like to get 10 most often sold products today and what I did is this:
SELECT p.*, COUNT(s.id) AS sumsell FROM product p LEFT JOIN sale s ON s.product_id = p.id AND s.dt >= '2010-01-01' AND s.dt < '2010-01-02' GROUP BY p.id ORDER BY sumsell DESC LIMIT 10, but performance of it is very slow.
What can I do to increase performance of this particular query?
The query involves a LEFT JOIN
which in MySQL world means that products
will be made leading in the query. Each record of product
will be taken and checked against sale
table to find out the number of matching records. If no matching records are found, 0 is returned.
Let's create the sample tables:
Read the rest of this entry »
SQL Server: deleting with self-referential FOREIGN KEY
From Stack Overflow:
I have an SQL Server table defined as below:
TestComposite id (PK) siteUrl (PK) name parentId 1 site1 Item1 NULL 2 site1 Item2 NULL 3 site1 Folder1 NULL 4 site1 Folder1.Item1 3 5 site1 Folder1.Item2 3 6 site1 Folder1.Folder1 3 7 site1 Folder1.Folder1.Item1 6 Items and folders are stored inside the same table
If an item is inside a folder, the
parentID
column is theid
of the folder.I would like to be able to
DELETE CASCADE
items/folders when I delete a folder.I tried to define a constraint similar to:
ALTER TABLE [TestComposite] ADD CONSTRAINT fk_parentid FOREIGN KEY (ParentID, SiteUrl) REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE, but it gives me this error:
Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
SQL Server does support chained CASCADE
updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).
SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.
However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.
Let's create a sample table:
Read the rest of this entry »
PostgreSQL: using recursive functions in nested sets
In the previous article, I discussed a way to improve nested sets model in PostgreSQL.
The approach shown in the article used an analytical function to filter all immediate children of a node in a recursive CTE.
This allowed us to filter a node's children on the level more efficiently than R-Tree or B-Tree approaches do (since they rely on COUNT(*)
).
That solution was pure SQL and it was quite fast, but not optimal.
The drawback of that solution is that it still needs to fetch all children of a node to apply the analytic function to them. This can take much time for the top of the hierarchy. And since the top of the hierarchy is what is what usually shown at the start page, it would be very nice to improve this query yet a little more.
We can do it by creating and using a simple recursive SQL function. This function does not even require PL/pgSQL to be enabled.
Let's create a sample table:
Read the rest of this entry »
PostgreSQL: nested sets and R-Tree
Comments enabled. I *really* need your comment
A feedback on one of my previous articles comparing adjacency list and nested sets models for PostgreSQL.
Jay writes:
In your series on adjacency lists vs nested sets, you discuss geometric types and R-Tree indexes in MySQL, but you don't discuss them when discussing the same subject with PostgreSQL, which also has geometric types and R-Tree indexing (mostly available through GiST indexes).
To make it simple I added the following line after the data insertion part of the script at Nested Sets - Postgresql:
ALTER TABLE t_hierarchy ADD COLUMN sets POLYGON; UPDATE t_hierarchy SET sets = POLYGON(BOX(POINT(-1,lft), POINT(1, rgt)));It needed to be a
POLYGON
instead of aBOX
since there is a@>(POLYGON,POLYGON)
function but no@>(BOX,BOX)
function, and the polygon was cast from the box to create the rectangle shape required.It outperforms the adjacency list on
all descendants; outperforms it onall ancestors(not by much); performs reasonably well onall descendants up to a certain levelon items with few descendants (e. g. 31415) and badly on items with many descendants (e. g. 42).It still completes in less than 20 seconds though, which is an improvement over 1 minute.
PostgreSQL does support R-Tree indexes indeed (through GiST interface), and they can be used to improve the efficiency of the nested sets model.
Let's create a sample table and try some of the queries that Jay proposed:
Read the rest of this entry »
Six degrees of separation
Answering questions asked on the site.
Kathy asks:
I am developing a social network site in PostgreSQL and want to find out if two people are no more than 6 friends apart.
If your site grows popular, most probably, they are not. But we better check.
On most social networks, friendship is a symmetric relationship (however, LiveJournal is a notable exception). This means that if Alice is a friend to Bob, then Bob is a friend to Alice as well.
The friendship relationship is best stored in a many-to-many link table with a PRIMARY KEY
on both link fields and an additional check condition: the friend with the least id should be stored in the first column. This is to avoid storing a relationship twice: a PRIMARY KEY
won't be violated if the same record with the columns swapped will be inserted, but the check constraint will. The check constraint will also forbid storing a friend relationship to itself.
Let's create a sample table:
Read the rest of this entry »
Efficient circle distance testing
Comments enabled. I *really* need your comment
Answering questions asked on the site.
eptil asks:
I am using SQL Server 2008, but not the spatial features.
I have a table with few entries, only 40,000. There is an
id INT PRIMARY KEY
column and two columns storing a 2d coordinate, both decimals.I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:
SELECT id, x, y
FROM mytable t1
WHERE (
SELECT COUNT(*)
FROM mytable t2
WHERE ABS(t1.x - t2.x) < 25 AND ABS(t1.y - t2.y) < 25 ) = 1 [/sourcecode]
This is taking 15 minutes to run at times.Is there a better way?
Of course using spatial abilities would be a better way, but it is possible to make do with plain SQL. This will also work in SQL Server 2005.
In most database engines, the spatial indexes are implemented as the R-Tree structures. SQL Server, however, uses another approach: surface tesselation.
Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain B-Tree index.
When SQL Server's optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that possibly can satisfy this predicate. Say, if the tiles are defined as squares with side 1, the predicate column.STDistance(@mypoint) < 2
can only be satisfied by the objects within 2 tiles away from @mypoint
's tile. This gives a square of 25 tiles with @mypoint
's tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.
Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don't have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.
Let's create a sample table and see how it works:
Read the rest of this entry »
Matching sets: aggregates vs. first miss
Comments enabled. I *really* need your comment
From Stack Overflow:
Here is my schema:
Suppliers sid sname address
Parts pid pname color
Catalog sid pid cost I need to find the sids of suppliers who supply every red part or every green part.
This task requires matching the sets.
We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.
Unlike other engines, MySQL does not provide the set operators like EXCEPT
or MINUS
that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.
There are two ways to check that:
First miss
technique: test each record from the subset candidate against the superset candidate, returningFALSE
if there is no match.Aggregate
technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match
Let's test which way is faster in which cases. To do this, we will need some sample tables:
Read the rest of this entry »