EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Things SQL needs: MERGE JOIN that would seek

with 6 comments

One of the most known and least used join algorithms in SQL engines is MERGE JOIN.

This algorithm operates on two sorted recordsets, keeping two pointers that chase each other.

The Wikipedia entry above describes it quite well in terms of algorithms. I’ll just make an animated GIF to make it more clear:

This is quite a nice and elegant algorithm, which, unfortunately, has two major drawbacks:

  1. It needs the recordsets to be sorted
  2. Even with the recordsets sorted, it is no better than a HASH JOIN

The sorting part is essential for this algorithm and there is nothing that can be done with it: the recordsets should be sorted, period. Databases, however, often provide the records in the sorted order: from clustered tables, indexes, previously sorted and ordered subqueries, spool tables etc.

But even when the recordsets are already sorted, on equijoins the MERGE JOIN is hardly faster than a HASH JOIN.

Why?
Read the rest of this entry »

Written by Quassnoi

May 7th, 2010 at 11:00 pm

Posted in Miscellaneous

Groups holding highest ranked items

Answering questions asked on the site.

Nate asks:

I know you’ve addressed similar issues related to the greatest-per-group query but this seems to be a different take on that.

Example table:

t_group
item_id group_id score
100 1 2
100 2 3
200 1 1
300 1 4
300 2 2

Each item may be in multiple groups. Each instance of an item in that group is given a score (how relevant it is the the group).

So given the data above, when querying for group 1 it should return items 200 and 300 (item 100‘s highest score is for group 2, so it’s excluded).

The classical greatest-n-per-group problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.

Let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 22nd, 2010 at 11:00 pm

Posted in MySQL

Hierarchical query in MySQL: limiting parents

Answering questions asked on the site.

James asks:

Your series on hierarchical queries in MySQL is tremendous! I’m using it to create a series of threaded conversations.

I’m wondering if there is a way to paginate these results.

Specifically, let’s say I want to limit the conversations to return 10 root nodes (parent=0) and all of their children in a query.

I can’t just limit the final query, because that will clip off children. I’ve tried to add LIMITs to your stored functions, but I’m not getting the magic just right.

How would you go about doing this?

A quick reminder: MySQL does not support recursion (either CONNECT BY style or recursive CTE style), so using an adjacency list model is a somewhat complicated task.

However, it is still possible. The main idea is storing the recursion state in a session variable and call a user-defined function repeatedly to iterate over the tree, thus emulating recursion. The article mentioned in the question shows how to do that.

Normally, reading and assigning session variables in the same query is discouraged in MySQL, since the order of evaluation is not guaranteed. However, in the case we only use the table as a dummy recordset and no values of the records are actually used in the function, so the actual values returned by the function are completely defined by the function itself. The table is only used to ensure that the function is called enough times, and to present its results in form of a native resultset (which can be returned or joined with).

To do something with the logic of the function (like, imposing a limit on the parent nodes without doing the same on the child nodes), we, therefore, should tweak the function code, not the query that calls the functions. The only thing that matters in such a query is the number of records returned and we don’t know it in design time.

Limiting the parent nodes is quite simple: we just use another session variable to track the number of parent branches yet to be returned and stop processing as soon as the limit is hit, that is the variable becomes zero.

Let’s create a sample table and see how to do this:
Read the rest of this entry »

Written by Quassnoi

April 18th, 2010 at 11:00 pm

Posted in MySQL

Date ranges: overlapping with priority

Answering questions asked on the site.

Jason Foster asks:

We have a table of student registrations:

Students
student_code course_code course_section session_cd
987654321 ESC102H1 Y 20085
998766543 ELEE203H F 20085

course_code and course_section identify a course, session_cd is an academic session, e. g. 20085, 20091, 20079.

The courses (stored in another table) have associated values for engineering design, complementary studies, etc., like that:

Courses
course_code course_section start_session end_session design science studies
ESC102H1 F 20071 20099 10 0 0
AER201Y1 Y 20059 NULL 0 0 30

, or like that:

In-house courses
course_code course_section student_code design science studies
ESC102H1 F 998766543 10 0 0

We are required by an external accreditation body to add up all of the units of engineering design, complementary studies, etc., taken by an individual student.

Where it gets really messy is that we have multiple data feeds for the associated values of courses. For example we have a set from the Registrar’s Office, the Civil Department, our In-House version, etc.

The rule is that In-House beats Civil beats the Registrar’s Office in the case of any duplication within the overlapping intervals.

The session_cd is of the form YYYY{1,5,9}.

Basically, we have three sets here.

To get the course hours for a given student we should find a record for him in the in-house set, or, failing that, find if the session is within the ranges of one of the external sets (Civil or Registrar). If both ranges contain the academic session the student took the course, Civil should be taken.

The first part is quite simple: we just LEFT JOIN students with the in-house courses and get the hours for the courses which are filled. The real problem is the next part: searching for the ranges containing a given value.

As I already mentioned in the previous posts, relational databases are not in general that efficient for the queries like that. It’s easy to use an index to find a value of a column within a given range, but B-Tree indexes are of little help in searching for a range of two columns containing a given value.

However, in this case, the data domain of session_cd is quite a limited set. For each pair of session_start and session_end it is easy to create a set of all possible values between session_start and session_end.

The overlapping parts of the session ranges from the two sets will yields two records for each of the sessions belonging to the range. Of these two records we will need to take the relevant one (that is Civil) by using DISTINCT ON with the additional sorting on the source (Civil goes first).

Then we just join the relevant records to the subset of the students which does not have corresponding records in the in-house version.

Finally, we need to union this with the in-house recordset.
Read the rest of this entry »

Written by Quassnoi

April 7th, 2010 at 11:00 pm

Posted in PostgreSQL

Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS

Answering questions asked on the site.

Andrew Stillard asks:

I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.

Users search for products based on the options via an Advanced Search menu.

The users need to be able to select multiple options upon which to query. I would normally use a JOIN if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the WHERE EXISTS statement.

The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.

The option table mentioned here is in fact an implementation of the EAV model in a relational database.

Each record basically contains 3 things: id of the product it describes; id of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.

This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.

However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.

There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:

  1. For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using COUNT(*) as a filter. The number of entity entries should be equal to the number of the attributes.
  2. Takes each entity and for each attribute, check if the condition holds.

The first approach uses a GROUP BY, the second one uses EXISTS.

Let’s create a sample table and see which one is more efficient:
Read the rest of this entry »

Written by Quassnoi

April 2nd, 2010 at 11:00 pm

Posted in MySQL

NoSQL

with 6 comments

I had a vision tonight.

Image by nataliej

A huge, dark, grim figure approached me, seized me with its long bony arms and made me see all the vanity of the world we are living in.

Bloated database engines, useless ACID requirements, meaningless joins are now in the past for me.

I decided to move to NoSQL.

Where do I begin?

Written by Quassnoi

April 1st, 2010 at 1:00 pm

Posted in Miscellaneous

Bayesian classification

From Stack Overflow:

Suppose you’ve visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I’m guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site’s demographics are different from this average (e.g. maybe it’s a site for parents, or for child-free people), the more it should count in my guess of your status.

What’s the best way to calculate this?

This is a classical application of Bayes’ Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Read the rest of this entry »

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

Greatest N per group: dealing with aggregates

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings
person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

Aggregates and LEFT JOIN

From Stack Overflow:

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt >= '2010-01-01'
        AND s.dt < '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10

, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let’s create the sample tables:
Read the rest of this entry »

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

SQL Server: deleting with self-referential FOREIGN KEY

From Stack Overflow:

I have an SQL Server table defined as below:

TestComposite
id (PK) siteUrl (PK) name parentId
1 site1 Item1 NULL
2 site1 Item2 NULL
3 site1 Folder1 NULL
4 site1 Folder1.Item1 3
5 site1 Folder1.Item2 3
6 site1 Folder1.Folder1 3
7 site1 Folder1.Folder1.Item1 6

Items and folders are stored inside the same table

If an item is inside a folder, the parentID column is the id of the folder.

I would like to be able to DELETE CASCADE items/folders when I delete a folder.

I tried to define a constraint similar to:

ALTER TABLE [TestComposite]
ADD CONSTRAINT fk_parentid
FOREIGN KEY (ParentID, SiteUrl)
REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE

, but it gives me this error:

Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

SQL Server does support chained CASCADE updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).

SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.

However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.

Let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 3rd, 2010 at 11:00 pm

Posted in SQL Server