Archive for the ‘MySQL’ Category
Matching sets: aggregates vs. first miss
From Stack Overflow:
Here is my schema:
Suppliers sid sname address
Parts pid pname color
Catalog sid pid cost I need to find the sids of suppliers who supply every red part or every green part.
This task requires matching the sets.
We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.
Unlike other engines, MySQL does not provide the set operators like EXCEPT or MINUS that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.
There are two ways to check that:
First miss
technique: test each record from the subset candidate against the superset candidate, returningFALSEif there is no match.Aggregate
technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match
Let’s test which way is faster in which cases. To do this, we will need some sample tables:
Read the rest of this entry »
Tags in nested sets: efficient indexing
Answering questions asked on the site.
Travis asks:
I just read your excellent post: Hierarchical data in MySQL: parents and children in one query.
I am currently trying to do something very similar to this. The main difference is, I am working with data that is in a nested set.
I would like to construct a query that will return a resultset of nodes matched with
LIKE, the ancestors of each of these nodes, and the immediate children of each of these nodes.
This is a very interesting question: it allows to demonstrate the usage of three types of indexes usable by MyISAM tables: BTREE, SPATIAL and FULLTEXT.
Nested sets is one of two models that are most often used to store hierarchical data in a relational table (the other one being adjacency list model). Unlike adjacency list, nested sets does not require ability to write recursive queries, which SQL originally lacked and still lacks now in MySQL (albeit it can be emulated to some extent). It is widely used in MySQL world.
I described both methods and their comparison in the article I wrote some time ago:
The main problem with the nested sets model is that though it is extremely fast with selecting descendants of a given node, it is very slow in selecting ancestors.
This is because of the way the B-Tree indexes work. They can be used to query for values of a column within the range of two constants, but not for the values of two columns holding a single constant between them. One needs the first condition to select the children (found between the lft and rgt of the given node), and the second condition to select the ancestor (with lft and rgt containing the lft and rgt of the given node).
That’s why selecting the children is fast and selecting the ancestors is slow.
To work around this, the sets that form the hierarchy can be described as geometrical objects, with the larger sets containing the smaller sets. These sets can be indexed with a SPATIAL index which is designed specially for this purpose and both children and ancestors can be queried for very efficiently.
Unfortunately, finding the depth level is quite a hard task for the nested sets model even with the SPATIAL indexes.
It would be quite an easy task is MySQL supported recursion: we could just run a query to find the siblings of each record by skipping their whole domains recursively.
However, MySQL‘s recursion support is very limited and it relies on the session variables, which are not recommended to use in the complex queries.
To cope with this, we need to mix the nested sets and the adjacency list models. Hierarchy will be stored in two seemingly redundant ways: the unique parent and the LineString representing the nested sets.
This will help us to use the R-Tree index to find all ancestors of a given node and also use B-Tree index to find its immediate children.
Finally, the question mentions using LIKE to find the initial nodes. LIKE predicate with the leading wildcards is not sargable in MySQL. However, it seems that the leading wildcards are only used to split the words. In this case, a FULLTEXT index and the MATCH query would be much more efficient, since FULLTEXT index allows indexing a single record with several keys (each one corresponding to a single word in the column’s text), so a search for the word in the space separated or a comma separated list uses the index and is much faster than scanning the whole table.
Hence, the query would use all three main types of indexes: BTREE, SPATIAL and FULLTEXT.
To illustrate everything said above, let’s create a sample table:
Read the rest of this entry »
Index search time depends on the value being searched
Answering questions asked on the site.
Daniel asks:
I have a table which stores track titles in a
VARCHAR(200)field. The field is indexed, but searching for titles beginning with a letter Z is noticeably slower than for those beginning with A, and the closer the initial letter is to Z, the slower is the query.My understanding is that a full table scan occurs, but
EXPLAINshows that the index is used. Besides, the table is quite large but the query is still reasonably fast.Could you please explain this behavior?
MySQL stores its indexes in B-Tree data structures.
The Wikipedia link above explains the structure quite well so I won’t repeat it. I’ll rather draw a picture similar to the one in the article:

This picture is quite self-explanatory. The records are sorted in a tree order, so if you are searching for a certain value, say, 11, you, starting from the first page, should find the link to follow. To do this, you need to find the pair of values less than and greater than 11. In this case, you should follow the link which is between 8 and 12. Then you search for the next pair, etc, until you find your value or reach the end and make sure that your value is not there.
Following the links is quite simple, but how does the engine search for the values within one page?
This depends on how you declared the table.
MyISAM supports two algorithms for storing the index keys in a page: packed keys and unpacked keys.
Read the rest of this entry »
Join on overlapping date ranges
This post is inspired by a discussion with John Didion:
Is there any way to optimize the query for overlapping ranges in MySQL if both ranges are dynamic?
I have two tables, each with integer range columns (specified as
LineString), and I want to find rows that overlap.No matter what I try, the query planner never uses any indexes.
This question addresses a well-known problem of efficient searching for the intersecting intervals. The queries that deal with it require ability to search for the intervals (stored in two distinct columns) containing a constant scalar value.
Plain B-Tree indexes used by most databases do not speed up the queries like that. However, MySQL supports SPATIAL indexes that can index two-dimensional shapes and efficiently search for the shapes containing a given point.
With a little effort, time intervals can be converted into the geometrical objects, indexed with a SPATIAL index and searched for the given point in time (also presented as a gemetrical object). This is described in the article about overlapping ranges in MySQL.
The query described in that article, however, searches for the intervals overlapping a constant range, provided as a parameter to the query. Now, I will discuss how to adapt the query for a JOIN between two tables.
Let’s create two sample tables, each containing a set of time ranges stored as geometrical objects, and find all records from both tables whose ranges overlap:
Read the rest of this entry »
Aggregates: subqueries vs. GROUP BY
From Stack Overflow:
I have a table
usersand there is a fieldinvited_by_idshowing user id of the person who invited this user.I need to make a MySQL query returning rows with all the fields from users plus a
invites_countfield showing how many people were invited by each user.
The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.
GROUP BY
The first approach is using GROUP BY:
SELECT u.*, COUNT(ui.id)
FROM users u
LEFT JOIN
users ui
ON ui.invited_by = u.id
GROUP BY
u.id
This is a GROUP BY with a self join, very simple. There are only two little things I’d like to pay some attention to.
First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).
Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).
This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it’s perfectly valid for the queries like ours. We don’t need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.
Subquery
This solution involves correlated subqueries:
SELECT u.*,
(
SELECT COUNT(*)
FROM users ui
WHERE ui.invited_by = u.id
)
FROM users u
Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.
Comparison
Let’s create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Read the rest of this entry »
Latest visitors
From Stack Overflow:
Say you want to display the latest visitors on a users profile page.
How would you structure this?
Perhaps a table called
uservisitors:
uservisitors userid visitorid time And how would you select this with MySQL without any duplicates?
What I mean is if user 1 visits user 2‘s profile, then 5 minutes later visits it again, I don’t want to show both entries: only the latest.
There are two approaches to this.
First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:
SELECT visitorid, MAX(time) AS lastvisit
FROM uservisitors
WHERE userid = 1
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5
However, there is a little problem with this solution.
Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY optimization for this query, it will still have to sort on MAX(time) to find 5 latest records.
This will require sorting the whole resultset which will be huge if the service is heavily loaded.
Let’s test it on a sample table:
Read the rest of this entry »
Linked lists in MySQL: multiple ordering
Answering questions asked on the site.
Rick McIntosh asks:
I have two tables, one with a list of spreadsheets (
a) and the other with a list of the column headings that show in the spreadsheets (b).
a id parent 1 2 2 0
b id parent aid 1 1 1 2 0 1 3 2 1 4 6 2 5 4 2 6 0 2 I want to bring the columns back in first the order of the spreadsheets as defined by the
a.parent_idthen ordered asb.parent_id:
id parent aid aparent 6 0 2 0 4 6 2 0 5 4 2 0 2 0 1 2 3 2 1 2 1 1 1 2
This can be done using the same recursion technique as the one that was used to build a simple linked list.
MySQL does not support recursion directly, but it can be emulated using subquery calls in the SELECT clause of the query, using session variables to store the recursion state.
In this case we need to do the following:
-
Generate a dummy recordset for recursion that would contain as many rows as the resulting recordset. This is best done by issuing a
JOINon theFOREIGN KEY, without any ordering. The values of the recordset will not be used in the actual query. -
Initialize
@aand@bto be the first value ofaand a 0, accordingly. -
In the loop, make a query that would return the next item of
bfor the current value of@a, or, should it fail, return the first item ofbfor the next@a. This is best done using aUNION ALLwith aLIMIT. -
Adjust
@aso that is points to the correct item: just select the appropriate value fromb
Let’s create a sample table:
Read the rest of this entry »
MySQL: Selecting records holding group-wise maximum (resolving ties)
Continuing the series on selecting records holding group-wise maximums.
The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.
If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of holding a group-wise maximum
.
In this article, I’ll show how to resolve the ties.
Let’s recreate the table we used in the previous article:
Read the rest of this entry »
MySQL: Selecting records holding group-wise maximum (on a unique column)
Answering questions asked on the site.
Juan asks:
Regarding this question:
I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.
I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more efficient.
A nice question, this is a common problem for most developers.
For those too lazy to follow the link, the problem in one sentence:
How do I select the whole records, grouped on
grouperand holding a group-wise maximum (or minimum) on other column?
I’ve already covered this in passing in various previous articles in my blog, and now I’ll try to make a little summary.
According to the SQL standards, every expression used in the SELECT list of a query that deals with GROUP BY should be either grouped by (that is, used in the GROUP BY) or aggregated.
The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.
The grouping expressions, of course, share their values across all the records constituting the group (since they are what that makes the group a group), and hence can be used as is.
Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.
The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a record in an aggregate query: you can only select a tuple of aggregated values. It’s possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of column being strictly positive, you can never find a record holding a SUM(column): it will be greater than any of the values of column.
However, with MIN or MAX, it is guaranteed that at least one record holds the non-NULL value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.
As SQL evolved, different solutions had been proposed to solve this problem, and I’ll try to cover these solutions in this series of articles.
First, I’ll cover MySQL (as per request)
MySQL‘s dialect of SQL is much more poor than those used by the other systems. That’s why most of the solutions used by the other systems, like analytic functions, DISTINCT ON etc. cannot be used in MySQL.
However, MySQL still provides some ways to make the queries that solve this task more efficient.
Let’s create a sample table:
Read the rest of this entry »
Selecting most relevant row
From Stack Overflow:
I have whole bunch of contact data that may or may not contain all info, like this:
Name Address1 Address2 Test User Address1 Address2 test@example.com NULL NULL NULL test@example.com For each email address, I would love to extract the most relevant record, that is the row with the contact information rather than the row with empty fields
To do this we need a sorting criterion: how to tell that a record is considered more relevant
than another record. We can use the number of the fields filled as a criterion.
Let’s create a sample table:
This table has 20,000 records with 20 distinct emails.
To find out the most filled values, we need to do the following:
- Select the list of the emails
- For each email, sort the records so that the most filled record goes first. To do this, we just cast the boolean values (like
name IS NULL) to the integers and add them up - Select the first record’s id and join on this id
This guarantees that for each distinct email in the table, exactly one record will be selected (despite the actual number of records filled).
And here is the query:
SELECT ai.*
FROM (
SELECT DISTINCT email
FROM address
) ad
JOIN address ai
ON ai.id =
(
SELECT id
FROM address ao
WHERE ao.email = ad.email
ORDER BY
(name IS NULL) + (address1 IS NULL) + (address2 IS NULL)
LIMIT 1
)
| id | name | address1 | address2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18390 | Name 18390 | Address 1 18390 | Address 2 18390 | test10@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3591 | Name 3591 | Address 1 3591 | test11@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2052 | Name 2052 | Address 2 2052 | test12@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19473 | Name 19473 | Address 1 19473 | Address 2 19473 | test13@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8234 | Name 8234 | Address 1 8234 | Address 2 8234 | test14@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14355 | Name 14355 | Address 1 14355 | Address 2 14355 | test15@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6916 | Name 6916 | Address 1 6916 | Address 2 6916 | test16@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8937 | Name 8937 | Address 1 8937 | Address 2 8937 | test17@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14378 | Name 14378 | Address 1 14378 | Address 2 14378 | test18@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3099 | Address 1 3099 | Address 2 3099 | test19@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2821 | Name 2821 | Address 1 2821 | Address 2 2821 | test1@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17580 | Name 17580 | Address 1 17580 | Address 2 17580 | test20@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16902 | Name 16902 | Address 1 16902 | test2@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17703 | Name 17703 | Address 1 17703 | Address 2 17703 | test3@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4324 | Name 4324 | Address 1 4324 | Address 2 4324 | test4@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2545 | Name 2545 | Address 1 2545 | Address 2 2545 | test5@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1546 | Name 1546 | Address 1 1546 | Address 2 1546 | test6@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18707 | Name 18707 | Address 1 18707 | Address 2 18707 | test7@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12048 | Address 1 12048 | Address 2 12048 | test8@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1009 | Name 1009 | Address 1 1009 | Address 2 1009 | test9@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 rows fetched in 0.0010s (0.2543s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | <derived2> | ALL | 20 | 100.00 | |||||
| 1 | PRIMARY | ai | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
| 3 | DEPENDENT SUBQUERY | ao | ref | ix_address_email | ix_address_email | 767 | ad.email | 646 | 100.00 | Using where; Using filesort |
| 2 | DERIVED | address | range | ix_address_email | 767 | 31 | 100.00 | Using index for group-by |
Field or reference 'ad.email' of SELECT #3 was resolved in SELECT #1 select `20091117_relevant`.`ai`.`id` AS `id`,`20091117_relevant`.`ai`.`name` AS `name`,`20091117_relevant`.`ai`.`address1` AS `address1`,`20091117_relevant`.`ai`.`address2` AS `address2`,`20091117_relevant`.`ai`.`email` AS `email` from (select distinct `20091117_relevant`.`address`.`email` AS `email` from `20091117_relevant`.`address`) `ad` join `20091117_relevant`.`address` `ai` where (`20091117_relevant`.`ai`.`id` = (select `20091117_relevant`.`ao`.`id` AS `id` from `20091117_relevant`.`address` `ao` where (`20091117_relevant`.`ao`.`email` = `ad`.`email`) order by ((isnull(`20091117_relevant`.`ao`.`name`) + isnull(`20091117_relevant`.`ao`.`address1`)) + isnull(`20091117_relevant`.`ao`.`address2`)) limit 1))
As you can see, not all records are filled completely.
Nevertheless, the query selects exactly one record per email, and this is the most filled record for this email.
Subscribe in a reader