EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Archive for the ‘MySQL’ Category

Empty row if condition does not match

with 4 comments

Just found that in a Google referer to the blog:

I want SQL to return blank row even if the condition does not match

This may be useful for certain ORMs which always expect a single row as a result of a query.

Say, we have a query like that:

SELECT  *
FROM    mytable
WHERE   id = 42

and want it to return a single row (possibly consisting of NULL values) no matter what.

If we had a join and the condition in the ON clause:

SELECT  m.*
FROM    values v
JOIN    mytable m
ON      m.id = v.value

, we could just rewrite an INNER JOIN to a LEFT JOIN.

SELECT  m.*
FROM    values v
LEFT JOIN
        mytable m
ON      m.id = v.value

This way, we would have at least one record returned for each entry in values.

In our original query we don't have a table to join with. But we can easily generate it:

SELECT  m.*
FROM    (
        SELECT  42 AS value
        ) v
LEFT JOIN
        mytable m
ON      m.id = v.value

If id is a PRIMARY KEY on mytable, this query would return exactly one record, regardless of whether it such an id exists in mytable or not.

Written by Quassnoi

January 3rd, 2013 at 11:00 pm

Posted in MySQL

MySQL: GROUP BY in UNION

with 2 comments

From Stack Overflow:

I have a query where I have a custom developed UDF that is used to calculate whether or not certain points are within a polygon (first query in UNION) or circular (second query in UNION) shape.

SELECT  a.geo_boundary_id, …
FROM     geo_boundary_vertex a, …
…
GROUP BY
        a.geo_boundary_id
UNION
SELECT  b.geo_boundary_id, …
FROM     geo_boundary b, …
…
GROUP BY
        b.geo_boundary_id

When I run an explain for the query I get filesort for both queries within the UNION.

Now, I can split the queries up and use the ORDER BY NULL trick to get rid of the filesort however when I attempt to add that to the end of a UNION it doesn't work.

How do I get rid of the filesort?

In MySQL, GROUP BY also implies ORDER BY on the same set of expressions in the same order. That's why it adds an additional filesort operation to sort the resultset if it does not come out naturally sorted (say, from an index).

This is not always a desired behavior, and MySQL manual suggests adding ORDER BY NULL to the queries where sorting is not required. This can improve performance of the queries.

Let's create a sample table and see:

Read the rest of this entry »

Written by Quassnoi

March 30th, 2011 at 11:00 pm

Posted in MySQL

MySQL: splitting aggregate queries

with 2 comments

Answering questions asked on the site.

Victor asks:

I have a table which I will call sale to protect the innocent:

Sale
id product price amount date

I need to retrieve ultimate values of price, amount and date for each product:

SELECT  product,
        MIN(price), MAX(price),
        MIN(amount), MAX(amount),
        MIN(date), MAX(date)
FROM    sale
GROUP BY
        product

The query only returns about 100 records.

I have all these fields indexed (together with product), but this still produces a full table scan over 3,000,000 records.

How do I speed up this query?

To retrieve the ultimate values of the fields, MySQL would just need to make a loose index scan over each index and take the max and min value of the field for each product.

However, the optimizer won't do when multiple indexes are involved. Instead, it will revert to a full scan.

There is a workaround for it. Let's create a sample table and see them:

Read the rest of this entry »

Written by Quassnoi

March 28th, 2011 at 11:00 pm

Posted in MySQL

Late row lookups: InnoDB

with 7 comments

Answering questions asked on the site.

Aryé asks:

Thanks for your article about late row lookups in MySQL.

I have two questions for you please:

  • Is this workaround specific to MyISAM engine?
  • How does PostgreSQL handle this?

The questions concerns a certain workaround for MySQL LIMIT … OFFSET queries like this:

SELECT  *
FROM    mytable
ORDER BY
        id
LIMIT   10
OFFSET  10000

which can be improved using a little rewrite:

SELECT  m.*
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT   10
        OFFSET  10000
        ) q
JOIN    mytable m
ON      m.id = q.id
ORDER BY
        m.id

For the rationale behind this improvement, please read the original article.

Now, to the questions.

The second questions is easy: PostgreSQL won't pull the fields from the table until it really needs them. If a query involving an ORDER BY along with LIMIT and OFFSET is optimized to use the index for the ORDER BY part, the table lookups won't happen for the records skipped.

Though PostgreSQL does not reflect the table lookups in the EXPLAIN output, a simple test would show us that they are done only LIMIT times, not OFFSET + LIMIT (like MySQL does).

Now, let's try to answer the first question: will this trick improve the queries against an InnoDB table?

To do this, we will create a sample table:

Read the rest of this entry »

Written by Quassnoi

February 11th, 2011 at 11:00 pm

Posted in MySQL

10 things in MySQL (that won’t work as expected)

with 33 comments

(I just discovered cracked.com)

#10. Searching for a NULL

SELECT  *
FROM    a
WHERE   a.column = NULL

In SQL, a NULL is never equal to anything, even another NULL. This query won't return anything and in fact will be thrown out by the optimizer when building the plan.

When searching for NULL values, use this instead:

SELECT  *
FROM    a
WHERE   a.column IS NULL

Read the rest of this entry »

Written by Quassnoi

November 3rd, 2010 at 11:00 pm

Posted in MySQL

Mixed ASC/DESC sorting in MySQL

with 10 comments

From Stack Overflow:

I have a table of blog posts, each with a foreign key back to it's author. There are less than 15,000 entries in this table.

This query scans over 19,000 rows (per EXPLAIN), requires a filesort (that might be regular MySQL behavior), and takes over 400ms to return 5 rows. possibly because of the complicated WHERE used to check if the item is actually published.

SELECT  blog_post.id,
        blog_post.title,
        blog_post.author_id,
        blog_post.has_been_fact_checked,
        blog_post.published_date,
        blog_post.ordering,
        auth_user.username,
        auth_user.email
FROM    blog_post
INNER JOIN
        auth_user
ON      auth_user.id = blog_post.author_id
WHERE   blog_post.is_approved = 1
        AND blog_post.has_been_fact_checked = 1
        AND blog_post.published_date < =  '2010-10-25 22:40:05'
ORDER BY
        blog_post.published_date DESC,
        blog_post.ordering ASC,
        blog_post.id DESC
LIMIT 5


How can I wrangle this query under control?

This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the ORDER BY perfectly, so the whole query could be executed using an index scan without any filesorts.

The only problem is that we have the mixed directions in ORDER BY, and MySQL does not support ASC / DESC clauses for the indexes.

With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the ordering column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That's what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.

This solution is nice, I hear people saying, but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think.

OK, this is a good point. Let's see what we can do with the current design, and, as always, create a sample table to do it:
Read the rest of this entry »

Written by Quassnoi

November 2nd, 2010 at 11:00 pm

Posted in MySQL

Correlated LEFT JOIN in a subquery

with one comment

Answering questions asked on the site.

Beren asks:

In my database, I store the information about the game matches.

A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players.

I store the matches in a table like this:

Game
id player1 player2 type

, where type is ENUM(player, team).

If type is player, the ids of players are stored in the record; if type is team, those of teams are stored.

Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.

Are two separate queries required for this?

This of course can be easily done using a single query.

Let's make a sample table and see:
Read the rest of this entry »

Written by Quassnoi

October 20th, 2010 at 11:00 pm

Posted in MySQL

OLAP in MySQL: four ways to filter on higher level dimensions

with 5 comments

Answering questions asked on the site.

Denis Kuzmenok asks:

I need some help with a query I'm fighting with.

I have these tables:

Product
id parent name
Site
id region name
Price
id product site value

For the products with a certain product.parent, I need to select minimal and maximal price within the given site.region, and return first 10 products ordered by the minimal price.

Each parent has 100 to 10,000 products, and there are 10,000,000 records in price table.

We have a classical OLAP task here: a fact table (price) and two dimension tables (product and site).

The task would be almost trivial if we were given the exact values of product and site. This way, we could build a composite index on these fields and value, filter on the exact values and get the first 10 values from the index.

However, we are given not the values of the dimensions themselves but those of the higher levels (parent for product and region for site).

Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.

This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.

Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.

There are three tables and, hence, 3! = 6 permutations that define all possible join orders:

product site price
product price site
site price product
site product price
price product site
price site product

However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:

product/site price
product price site
site price product
price product/site

The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.

Now, let's build the queries for all types of the join. To do this, we will create sample tables:
Read the rest of this entry »

Written by Quassnoi

September 30th, 2010 at 11:00 pm

Posted in MySQL

20 latest unique records

with 3 comments

From Stack Overflow:

I have a logfile which logs the insert/delete/updates from all kinds of tables.

I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (datetime DESC)

A common solution for such a task would be writing an aggregate query with ORDER BY and LIMIT:

SELECT  person, MAX(ts) AS last_update
FROM    logfile
GROUP BY
        person
ORDER BY
        last_update DESC
LIMIT 20

What's bad in this solution? Performance, as usual.

Since last_update is an aggregate, it cannot be indexed. And ORDER BY on unindexed fields results in our good old friend, filesort.

Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts), MySQL will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL.

We should use another approach here. Let's create a sample table and test this approach:
Read the rest of this entry »

Written by Quassnoi

August 24th, 2010 at 11:00 pm

Posted in MySQL

Indexing for ORDER BY / LIMIT

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Frode Underhill asks:

I have some applications that are logging to a MySQL database table.

The table is pretty standard on the form:

time
BIGINT(20)
source
TINYINT(4)
severity
ENUM
text
VARCHAR(255)

, where source identifies the system that generated the log entry.

There are very many entries in the table (>100 million), of which 99.9999% are debug or info messages.

I'm making an interface for browsing this log, which means I'll be doing queries like

SELECT  *
FROM    log 
WHERE   source = 2
        AND severity IN (1,2) 
        AND time > 12345 
ORDER BY
        time ASC
LIMIT 30

, if I want to find debug or info log entries from a certain point in time, or

SELECT  *
FROM    log 
WHERE   source = 2
        AND severity IN (1,2) 
        AND time < 12345 
ORDER BY
        time DESC
LIMIT 30

for finding entries right before a certain time.

How would one go about indexing & querying such a table?

I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there's always some set of parameters that results in a really slow query.

The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (severity IN (1, 2) in this case).

Recently I wrote an article with a proposal to improve SQL optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:

No optimizers can handle this condition automatically yet, so we'll need to emulate it.

Since the severity field is defined as an enum with only 5 values possible, any range condition on this field can be satisfied by no more than 5 distinct values, thus making this table ideal for rewriting the query.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

June 30th, 2010 at 11:00 pm

Posted in MySQL