EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘MySQL’ Category

LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns

with one comment

In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.

Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.

In SQL, there are at least three methods to implement it:

LEFT JOIN / IS NULL

SELECT  o.*
FROM    outer o
LEFT JOIN
inner i
ON      i.value = o.value
WHERE   i.value IS NULL

NOT IN

SELECT  o.*
FROM    outer o
WHERE   o.value NOT IN
(
SELECT  value
FROM    inner
)

NOT EXISTS

SELECT  o.*
FROM    outer o
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    inner i
WHERE   i.value = o.value
)

When inner.value is marked as NOT NULL, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in MySQL.

Now, what if inner.value is not nullable and does contain some NULL values?

Let's create some sample tables:
Read the rest of this entry »

Written by Quassnoi

May 27th, 2010 at 11:00 pm

Posted in MySQL

MAX and MIN on a composite index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Ivo Radev asks:

I am trying to make a very simple query.

We have a log table which different machines write to. Given the machine list, I need to find the latest log timestamp.

Currently, the query looks like this:

SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine IN ($machines)

, and I pass the comma-separated list of $machines from PHP.

The weird thing is that the query is literally instant when there is only one machine (any) in the list but slow when there are multiple machines.

I'm considering doing it in separate queries and then process the results in PHP. However I'd like to know if there is a fast solution in MySQL.

Most probably, there is a composite index on (log_machine, log_time) which is being used for the query.

Usually, a query like this:

SELECT  MAX(log_time)
FROM    log_table

on the indexed field log_time can be served with a single index seek on the index.

Indeed, the MAX(log_time), by definition, is the latest entry in the index order, and can be fetched merely by finding the trailing index entry. It's a matter of several page reads in the B-Tree, each one following the rightmost link to the lower-level page.

Similarly, this query:

SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine = $my_machine

can be served with a single index seek too. However, the index should include log_machine as a leading column.

In this case, a set of records satisfying the WHERE clause of the query is represented by a single logically continuous block of records in the index, each one sharing the same value of log_machine. MAX(log_time) will of course be held by the last record in this block. MySQL just finds that last record and takes the log_time out of it.

Now, what if we have a multiple condition on log_machine?
Read the rest of this entry »

Written by Quassnoi

May 8th, 2010 at 11:00 pm

Posted in MySQL

Groups holding highest ranked items

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Nate asks:

I know you've addressed similar issues related to the greatest-per-group query but this seems to be a different take on that.

Example table:

t_group
item_id group_id score
100 1 2
100 2 3
200 1 1
300 1 4
300 2 2

Each item may be in multiple groups. Each instance of an item in that group is given a score (how relevant it is the the group).

So given the data above, when querying for group 1 it should return items 200 and 300 (item 100's highest score is for group 2, so it's excluded).

The classical greatest-n-per-group problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 22nd, 2010 at 11:00 pm

Posted in MySQL

Hierarchical query in MySQL: limiting parents

with 6 comments

Answering questions asked on the site.

James asks:

Your series on hierarchical queries in MySQL is tremendous! I'm using it to create a series of threaded conversations.

I'm wondering if there is a way to paginate these results.

Specifically, let's say I want to limit the conversations to return 10 root nodes (parent=0) and all of their children in a query.

I can't just limit the final query, because that will clip off children. I've tried to add LIMITs to your stored functions, but I'm not getting the magic just right.

How would you go about doing this?

A quick reminder: MySQL does not support recursion (either CONNECT BY style or recursive CTE style), so using an adjacency list model is a somewhat complicated task.

However, it is still possible. The main idea is storing the recursion state in a session variable and call a user-defined function repeatedly to iterate over the tree, thus emulating recursion. The article mentioned in the question shows how to do that.

Normally, reading and assigning session variables in the same query is discouraged in MySQL, since the order of evaluation is not guaranteed. However, in the case we only use the table as a dummy recordset and no values of the records are actually used in the function, so the actual values returned by the function are completely defined by the function itself. The table is only used to ensure that the function is called enough times, and to present its results in form of a native resultset (which can be returned or joined with).

To do something with the logic of the function (like, imposing a limit on the parent nodes without doing the same on the child nodes), we, therefore, should tweak the function code, not the query that calls the functions. The only thing that matters in such a query is the number of records returned and we don't know it in design time.

Limiting the parent nodes is quite simple: we just use another session variable to track the number of parent branches yet to be returned and stop processing as soon as the limit is hit, that is the variable becomes zero.

Let's create a sample table and see how to do this:
Read the rest of this entry »

Written by Quassnoi

April 18th, 2010 at 11:00 pm

Posted in MySQL

Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS

with 5 comments

Answering questions asked on the site.

Andrew Stillard asks:

I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.

Users search for products based on the options via an Advanced Search menu.

The users need to be able to select multiple options upon which to query. I would normally use a JOIN if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the WHERE EXISTS statement.

The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.

The option table mentioned here is in fact an implementation of the EAV model in a relational database.

Each record basically contains 3 things: id of the product it describes; id of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.

This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.

However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.

There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:

  1. For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using COUNT(*) as a filter. The number of entity entries should be equal to the number of the attributes.
  2. Takes each entity and for each attribute, check if the condition holds.

The first approach uses a GROUP BY, the second one uses EXISTS.

Let's create a sample table and see which one is more efficient:
Read the rest of this entry »

Written by Quassnoi

April 2nd, 2010 at 11:00 pm

Posted in MySQL

Bayesian classification

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose you've visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male.

I'm guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%.

The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

This is a classical application of Bayes' Theorem.

The formula to calculate the posterior probability is:

P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A) / (P(B|A) × P(A) + P(B|A*) × P(A*))

, where:

  • P(A|B) is the posterior probability of the visitor being a male (given that he visited the site)
  • P(A) is the prior probability of the visitor being a male (initially, 50%)
  • P(B) is the probability of (any Internet user) visiting the site
  • P(B|A) is the probability of a user visiting the site, given that he is a male
  • P(A*) is the prior probability of the visitor not being a male (initially, 50%)
  • P(B|A*) is the probability of a user visiting the site, given that she is not a male.

Read the rest of this entry »

Written by Quassnoi

March 25th, 2010 at 11:00 pm

Posted in MySQL

Greatest N per group: dealing with aggregates

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vlad Enache asks:

In MySQL I have a table called meanings with three columns:

meanings
person word meaning
1 1 4
1 2 19
1 2 7
1 3 5

word has 16 possible values, meaning has 26.

A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2.

There will be thousands of persons.

I need to find the top three meanings for each of the 16 words, with their frequencies. Something like:

  • word 1: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)
  • word 2: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)

etc.

Is it possible to solve this with a single MySQL query?

This task is a typical greatest-n-per-group problem.

Earlier I described some solutions to it, one using session variables, and another one using LIMIT in a subquery. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest aggregates, not three greatest records. It is not recommended to mix the session variables with the JOINs, and using the LIMIT solution would be inefficient, since the aggregates can not be indexed.

Some databases, namely, PostgreSQL, used to exploit the array functionality for this task (before the window functions were introduced in 8.4).

Unfortunately, MySQL does not support arrays, but we can emulate this behavior using string functions and GROUP_CONCAT.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

March 18th, 2010 at 11:00 pm

Posted in MySQL

Aggregates and LEFT JOIN

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table product with products and table sale with all sale operations that were done on these products.

I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
sale s
ON      s.product_id = p.id
AND s.dt >= '2010-01-01'
AND s.dt < '2010-01-02'
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10

, but performance of it is very slow.

What can I do to increase performance of this particular query?

The query involves a LEFT JOIN which in MySQL world means that products will be made leading in the query. Each record of product will be taken and checked against sale table to find out the number of matching records. If no matching records are found, 0 is returned.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

March 5th, 2010 at 11:00 pm

Posted in MySQL

Matching sets: aggregates vs. first miss

Comments enabled. I *really* need your comment

From Stack Overflow:

Here is my schema:

Suppliers
sid sname address
Parts
pid pname color
Catalog
sid pid cost

I need to find the sids of suppliers who supply every red part or every green part.

This task requires matching the sets.

We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.

Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.

There are two ways to check that:

  • First miss technique: test each record from the subset candidate against the superset candidate, returning FALSE if there is no match.
  • Aggregate technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match

Let's test which way is faster in which cases. To do this, we will need some sample tables:
Read the rest of this entry »

Written by Quassnoi

February 25th, 2010 at 11:00 pm

Posted in MySQL

Tags in nested sets: efficient indexing

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Travis asks:

I just read your excellent post: Hierarchical data in MySQL: parents and children in one query.

I am currently trying to do something very similar to this. The main difference is, I am working with data that is in a nested set.

I would like to construct a query that will return a resultset of nodes matched with LIKE, the ancestors of each of these nodes, and the immediate children of each of these nodes.

This is a very interesting question: it allows to demonstrate the usage of three types of indexes usable by MyISAM tables: BTREE, SPATIAL and FULLTEXT.

Nested sets is one of two models that are most often used to store hierarchical data in a relational table (the other one being adjacency list model). Unlike adjacency list, nested sets does not require ability to write recursive queries, which SQL originally lacked and still lacks now in MySQL (albeit it can be emulated to some extent). It is widely used in MySQL world.

I described both methods and their comparison in the article I wrote some time ago:

The main problem with the nested sets model is that though it is extremely fast with selecting descendants of a given node, it is very slow in selecting ancestors.

This is because of the way the B-Tree indexes work. They can be used to query for values of a column within the range of two constants, but not for the values of two columns holding a single constant between them. One needs the first condition to select the children (found between the lft and rgt of the given node), and the second condition to select the ancestor (with lft and rgt containing the lft and rgt of the given node).

That's why selecting the children is fast and selecting the ancestors is slow.

To work around this, the sets that form the hierarchy can be described as geometrical objects, with the larger sets containing the smaller sets. These sets can be indexed with a SPATIAL index which is designed specially for this purpose and both children and ancestors can be queried for very efficiently.

Unfortunately, finding the depth level is quite a hard task for the nested sets model even with the SPATIAL indexes.

It would be quite an easy task is MySQL supported recursion: we could just run a query to find the siblings of each record by skipping their whole domains recursively.

However, MySQL's recursion support is very limited and it relies on the session variables, which are not recommended to use in the complex queries.

To cope with this, we need to mix the nested sets and the adjacency list models. Hierarchy will be stored in two seemingly redundant ways: the unique parent and the LineString representing the nested sets.

This will help us to use the R-Tree index to find all ancestors of a given node and also use B-Tree index to find its immediate children.

Finally, the question mentions using LIKE to find the initial nodes. LIKE predicate with the leading wildcards is not sargable in MySQL. However, it seems that the leading wildcards are only used to split the words. In this case, a FULLTEXT index and the MATCH query would be much more efficient, since FULLTEXT index allows indexing a single record with several keys (each one corresponding to a single word in the column's text), so a search for the word in the space separated or a comma separated list uses the index and is much faster than scanning the whole table.

Hence, the query would use all three main types of indexes: BTREE, SPATIAL and FULLTEXT.

To illustrate everything said above, let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

February 16th, 2010 at 11:00 pm

Posted in MySQL