Mixed ASC/DESC sorting in MySQL
From Stack Overflow:
I have a table of blog posts, each with a foreign key back to it’s author. There are less than 15,000 entries in this table.
This query scans over 19,000 rows (per
EXPLAIN), requires afilesort(that might be regular MySQL behavior), and takes over 400ms to return 5 rows. possibly because of the complicatedWHEREused to check if the item is actually published.SELECT blog_post.id, blog_post.title, blog_post.author_id, blog_post.has_been_fact_checked, blog_post.published_date, blog_post.ordering, auth_user.username, auth_user.email FROM blog_post INNER JOIN auth_user ON auth_user.id = blog_post.author_id WHERE blog_post.is_approved = 1 AND blog_post.has_been_fact_checked = 1 AND blog_post.published_date < = '2010-10-25 22:40:05' ORDER BY blog_post.published_date DESC, blog_post.ordering ASC, blog_post.id DESC LIMIT 5
How can I wrangle this query under control?
This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the ORDER BY perfectly, so the whole query could be executed using an index scan without any filesorts.
The only problem is that we have the mixed directions in ORDER BY, and MySQL does not support ASC / DESC clauses for the indexes.
With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the ordering column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That’s what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.
This solution is nice
, I hear people saying, but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think
.
OK, this is a good point. Let’s see what we can do with the current design, and, as always, create a sample table to do it:
Read the rest of this entry »
Correlated LEFT JOIN in a subquery
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Beren asks:
In my database, I store the information about the game matches.
A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players.
I store the matches in a table like this:
Game id player1 player2 type , where
typeisENUM(player, team).If
typeisplayer, the ids of players are stored in the record; if type isteam, those of teams are stored.Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.
Are two separate queries required for this?
This of course can be easily done using a single query.
Let’s make a sample table and see:
Read the rest of this entry »
OLAP in MySQL: four ways to filter on higher level dimensions
Answering questions asked on the site.
Denis Kuzmenok asks:
I need some help with a query I’m fighting with.
I have these tables:
Product id parent name
Site id region name
Price id product site value For the products with a certain
product.parent, I need to select minimal and maximal price within the givensite.region, and return first 10 products ordered by the minimal price.Each
parenthas 100 to 10,000 products, and there are 10,000,000 records inpricetable.
We have a classical OLAP task here: a fact table (price) and two dimension tables (product and site).
The task would be almost trivial if we were given the exact values of product and site. This way, we could build a composite index on these fields and value, filter on the exact values and get the first 10 values from the index.
However, we are given not the values of the dimensions themselves but those of the higher levels (parent for product and region for site).
Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.
This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.
Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.
There are three tables and, hence, 3! = 6 permutations that define all possible join orders:
| product | site | price |
| product | price | site |
| site | price | product |
| site | product | price |
| price | product | site |
| price | site | product |
However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:
| product/site | price | |
| product | price | site |
| site | price | product |
| price | product/site | |
The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.
Now, let’s build the queries for all types of the join. To do this, we will create sample tables:
Read the rest of this entry »
20 latest unique records
From Stack Overflow:
I have a logfile which logs the insert/delete/updates from all kinds of tables.
I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (
datetime DESC)
A common solution for such a task would be writing an aggregate query with ORDER BY and LIMIT:
SELECT person, MAX(ts) AS last_update
FROM logfile
GROUP BY
person
ORDER BY
last_update DESC
LIMIT 20
What’s bad in this solution? Performance, as usual.
Since last_update is an aggregate, it cannot be indexed. And ORDER BY on unindexed fields results in our good old friend, filesort.
Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts), MySQL will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL.
We should use another approach here. Let’s create a sample table and test this approach:
Read the rest of this entry »
Questions
I had tons of work the last month and got a whole bunch of questions in the queue which I haven’t answered yet.

Pawel, Mark, Kate, Felix, Andrew, Ludwig, Santos, Jeremy, John, another John, Anjan and Dave — I do remember about you, guys!
Will try to resume the normal blogging schedule the next week.
Stay subscribed!
Indexing for ORDER BY / LIMIT
Answering questions asked on the site.
Frode Underhill asks:
I have some applications that are logging to a MySQL database table.
The table is pretty standard on the form:
time BIGINT(20)source TINYINT(4)severity ENUMtext VARCHAR(255), where
sourceidentifies the system that generated the log entry.There are very many entries in the table (>100 million), of which 99.9999% are debug or info messages.
I’m making an interface for browsing this log, which means I’ll be doing queries like
SELECT * FROM log WHERE source = 2 AND severity IN (1,2) AND time > 12345 ORDER BY time ASC LIMIT 30, if I want to find debug or info log entries from a certain point in time, or
SELECT * FROM log WHERE source = 2 AND severity IN (1,2) AND time < 12345 ORDER BY time DESC LIMIT 30for finding entries right before a certain time.
How would one go about indexing & querying such a table?
I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there’s always some set of parameters that results in a really slow query.
The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (severity IN (1, 2) in this case).
Recently I wrote an article with a proposal to improve SQL optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:
No optimizers can handle this condition automatically yet, so we’ll need to emulate it.
Since the severity field is defined as an enum with only 5 values possible, any range condition on this field can be satisfied by no more than 5 distinct values, thus making this table ideal for rewriting the query.
Let’s create a sample table:
Read the rest of this entry »
GROUP_CONCAT in SQL Server
I’m finally back from my vacation. Tunisia’s great: dates, Carthage, sea and stuff.
Now, to the questions.
Mahen asks:
Create a table called
Group:
Group id prodname 1 X 1 Y 1 Z 2 A 2 B 2 C The resultset should look like this:
id prodname 1 X,Y,Z 2 A,B,C Can you please help me to solve the above problem using a recursive CTE?
This is out good old friend, GROUP_CONCAT. It’s an aggregate function that returns all strings within a group, concatenated. It’s somewhat different from the other aggregate functions, because, first, dealing with the concatenated string can be quite a tedious task for the groups with lots of records (large strings tend to overflow), and, second, the result depends on the order of the arguments (which is normally not the case for the aggregate functions). It’s not a part of a standard SQL and as for now is implemented only by MySQL with some extra vendor-specific keywords (like ORDER BY within the argument list).
This functionality, however, is often asked for and I have written some articles about implementing this in PostgreSQL and Oracle.
Now, let’s see how to do it in SQL Server.
Usually, SQL Server‘s FOR XML clause is exploited to concatenate the strings. To do this, we obtain a list of group identifiers and for each group, retrieve all it’s product names with a subquery appended with FOR XML PATH(''). This makes a single XML column out of the recordset:
Read the rest of this entry »
LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns
In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.
Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.
In SQL, there are at least three methods to implement it:
LEFT JOIN / IS NULL
SELECT o.*
FROM outer o
LEFT JOIN
inner i
ON i.value = o.value
WHERE i.value IS NULL
NOT IN
SELECT o.*
FROM outer o
WHERE o.value NOT IN
(
SELECT value
FROM inner
)
NOT EXISTS
SELECT o.*
FROM outer o
WHERE NOT EXISTS
(
SELECT NULL
FROM inner i
WHERE i.value = o.value
)
When inner.value is marked as NOT NULL, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in MySQL.
Now, what if inner.value is not nullable and does contain some NULL values?
Let’s create some sample tables:
Read the rest of this entry »
Things SQL needs: determining range cardinality
What is the problem with this query?
SELECT *
FROM orders
WHERE quantity < = 4
AND urgency <= 4
The problem is indexing strategy, of course. Which columns should we index?
If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.
Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.
Why, create a composite index!
, some will say.
Unfortunately, that won’t help much.
A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.
The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.
Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
Read the rest of this entry »
MAX and MIN on a composite index
Answering questions asked on the site.
Ivo Radev asks:
I am trying to make a very simple query.
We have a log table which different machines write to. Given the machine list, I need to find the latest log timestamp.
Currently, the query looks like this:
SELECT MAX(log_time) FROM log_table WHERE log_machine IN ($machines), and I pass the comma-separated list of
$machinesfrom PHP.The weird thing is that the query is literally instant when there is only one machine (any) in the list but slow when there are multiple machines.
I’m considering doing it in separate queries and then process the results in PHP. However I’d like to know if there is a fast solution in MySQL.
Most probably, there is a composite index on (log_machine, log_time) which is being used for the query.
Usually, a query like this:
SELECT MAX(log_time) FROM log_table
on the indexed field log_time can be served with a single index seek on the index.
Indeed, the MAX(log_time), by definition, is the latest entry in the index order, and can be fetched merely by finding the trailing index entry. It’s a matter of several page reads in the B-Tree, each one following the rightmost link to the lower-level page.
Similarly, this query:
SELECT MAX(log_time) FROM log_table WHERE log_machine = $my_machine
can be served with a single index seek too. However, the index should include log_machine as a leading column.
In this case, a set of records satisfying the WHERE clause of the query is represented by a single logically continuous block of records in the index, each one sharing the same value of log_machine. MAX(log_time) will of course be held by the last record in this block. MySQL just finds that last record and takes the log_time out of it.
Now, what if we have a multiple condition on log_machine?
Read the rest of this entry »
Subscribe in a reader