EXPLAIN EXTENDED

How to create fast database queries

What’s UNPIVOT good for?

with 5 comments

Answering questions asked on the site.

Karen asks:

… I've always thought PIVOT and UNPIVOT are signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?

I've made an actual use for them in a project I've been working on for the last several months (which is partly why there were no updates for so long!)

Part of the project is a task management system where each task has several persons related to it. There can be the creator of the task, the person the task is assigned to, the actual author of the task (on behalf of whom the task is created), and the task can be possible completed by a person or deleted by a person. A total of 5 fields related to persons.

Now, we need to take all tasks within a certain time range and list all people involved in them.

Let's create a sample table and see how would we do that.

Read the rest of this entry »

Written by Quassnoi

June 30th, 2011 at 11:00 pm

Posted in SQL Server

Shared Plan and Algorithm Network Cache (SPANC)

with 5 comments

Due to the nature of my work I have to deal with various database systems.

While SQL is more or less standardized, the optimizers are implemented differently in the different systems. Some systems cannot join tables with anything other than nested loops, the other can only GROUP BY using a sort, etc.

So when you write a join in, say, MySQL, you cannot expect it to be a sort merge join (and you should consider this fact when designing the query). Or, when you write a DISTINCT in SQL Server, you can't expect a loose index scan. These are limitations put by their optimizers.

However, in the last three months I noticed a great improvement in the queries where I could not expect any.

It started when I tried to debug this in SQL Server:

SELECT  DISTINCT order
FROM    orderItem

while yielded this plan:

Similar results were obtained on Oracle:

Plan hash value: 1345318323

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                                         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                                  |             |       |   200 |     2  (50)| 00:00:01 |
|   1 |  REMOTE SPANC QUERY (SQLSERVER, MYSQL, POSTGRESQL, STACKOVERFLOW) |             |       |   200 |     2  (50)| 00:00:01 |
---------------------------------------------------------------------------------------------------------------------------------

, MySQL:

+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
| id | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows    | Extra                                               |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
|  1 | SIMPLE      | orderItem | spanc | NULL          | ALL     | NULL    | NULL |         | Using Oracle, PostgreSQL, SQL Server, StackOverflow |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+

and PostgreSQL:

Seq Scan on OrderItem  (cost=0.00..6.44 width=4)
 -> Remote Scan on SPANC (Oracle, MySQL, SQL Server, StackOverflow)   (cost=0.00..100.00 width=4)

Network analysis has shown weird encrypted activity between the servers in my internal network which host SQL Server, Oracle, PostgreSQL and MySQL servers.

Ultimately, there was unencrypted activity outside of the internal network which turned out to be an HTTP POST request followed by several GET polls to http://stackoverflow.com/questions/5518080/distinct-optimization.

It seems that the developers of major database systems agreed to share the knowledge about the most efficient query plans in some kind of a distributed storage (which probably is called SPANC as we can see in the query plans) and provide an interface to access each other's systems.

It also seems that these systems treat Stack Overflow as an external optimization engine where the most experienced developers can build their plans for them in a most efficient way.

I would be glad to have further clarification from the companies staff.

This also begs a question: how many of regular Stack Overflow participants are in fact query engines disguised as curious fellow developers?

It would be definitely nice to know.

Written by Quassnoi

April 1st, 2011 at 11:00 pm

Posted in Miscellaneous

MySQL: GROUP BY in UNION

with 2 comments

From Stack Overflow:

I have a query where I have a custom developed UDF that is used to calculate whether or not certain points are within a polygon (first query in UNION) or circular (second query in UNION) shape.

SELECT  a.geo_boundary_id, …
FROM     geo_boundary_vertex a, …
…
GROUP BY
a.geo_boundary_id
UNION
SELECT  b.geo_boundary_id, …
FROM     geo_boundary b, …
…
GROUP BY
b.geo_boundary_id

When I run an explain for the query I get filesort for both queries within the UNION.

Now, I can split the queries up and use the ORDER BY NULL trick to get rid of the filesort however when I attempt to add that to the end of a UNION it doesn't work.

How do I get rid of the filesort?

In MySQL, GROUP BY also implies ORDER BY on the same set of expressions in the same order. That's why it adds an additional filesort operation to sort the resultset if it does not come out naturally sorted (say, from an index).

This is not always a desired behavior, and MySQL manual suggests adding ORDER BY NULL to the queries where sorting is not required. This can improve performance of the queries.

Let's create a sample table and see:

Read the rest of this entry »

Written by Quassnoi

March 30th, 2011 at 11:00 pm

Posted in MySQL

MySQL: splitting aggregate queries

with 2 comments

Answering questions asked on the site.

Victor asks:

I have a table which I will call sale to protect the innocent:

Sale
id product price amount date

I need to retrieve ultimate values of price, amount and date for each product:

SELECT  product,
MIN(price), MAX(price),
MIN(amount), MAX(amount),
MIN(date), MAX(date)
FROM    sale
GROUP BY
product

The query only returns about 100 records.

I have all these fields indexed (together with product), but this still produces a full table scan over 3,000,000 records.

How do I speed up this query?

To retrieve the ultimate values of the fields, MySQL would just need to make a loose index scan over each index and take the max and min value of the field for each product.

However, the optimizer won't do when multiple indexes are involved. Instead, it will revert to a full scan.

There is a workaround for it. Let's create a sample table and see them:

Read the rest of this entry »

Written by Quassnoi

March 28th, 2011 at 11:00 pm

Posted in MySQL

Things SQL needs: SERIES()

with 9 comments

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id source value
1 1 10
2 1 20
3 2 15
4 2 25
5 1 45
6 3 50
7 3 35
8 1 40
9 1 10

The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:

source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00

This can be used for different things. I used that for:

  • Reading sensors from a moving elevator (thus tracking its position)
  • Recording user's activity on a site
  • Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of source are repeating so a mere GROUP BY won't work here.

In the systems supporting window functions there is a workaround for that:

Read the rest of this entry »

Written by Quassnoi

February 18th, 2011 at 11:00 pm

Posted in Miscellaneous

Late row lookups: InnoDB

with 8 comments

Answering questions asked on the site.

Aryé asks:

Thanks for your article about late row lookups in MySQL.

I have two questions for you please:

  • Is this workaround specific to MyISAM engine?
  • How does PostgreSQL handle this?

The questions concerns a certain workaround for MySQL LIMIT … OFFSET queries like this:

SELECT  *
FROM    mytable
ORDER BY
id
LIMIT   10
OFFSET  10000

which can be improved using a little rewrite:

SELECT  m.*
FROM    (
SELECT  id
FROM    mytable
ORDER BY
id
LIMIT   10
OFFSET  10000
) q
JOIN    mytable m
ON      m.id = q.id
ORDER BY
m.id

For the rationale behind this improvement, please read the original article.

Now, to the questions.

The second questions is easy: PostgreSQL won't pull the fields from the table until it really needs them. If a query involving an ORDER BY along with LIMIT and OFFSET is optimized to use the index for the ORDER BY part, the table lookups won't happen for the records skipped.

Though PostgreSQL does not reflect the table lookups in the EXPLAIN output, a simple test would show us that they are done only LIMIT times, not OFFSET + LIMIT (like MySQL does).

Now, let's try to answer the first question: will this trick improve the queries against an InnoDB table?

To do this, we will create a sample table:

Read the rest of this entry »

Written by Quassnoi

February 11th, 2011 at 11:00 pm

Posted in MySQL

Happy New Year!

with 9 comments

Some say SQL is not good at graphics.

Well, they have some point. Database engines lack scanner drivers, there is no easy way to do sepia, and magic wand, let's be honest, is just poor.

However, you can make some New Year paintings with SQL.

Let's make a New Year clock showing 12 o'clock in Oracle.

Read the rest of this entry »

Written by Quassnoi

December 31st, 2010 at 11:00 pm

Posted in Miscellaneous

PostgreSQL: parametrizing a recursive CTE

with 2 comments

Answering questions asked on the site.

Jan Suchal asks:

We've started playing with PostgreSQL and recursive queries. Looking at example that does basic graph traversal from http://www.postgresql.org/docs/9.0/static/queries-with.html.

We would like to have a view called paths defined like this:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph

By calling

SELECT  *
FROM    paths
WHERE   path[1] = :node_id

we would get all paths from a certain node.

The problem here is with performance. When you want this to be quick you need to add a condition for the anchor part of the UNION like this:

WITH RECURSIVE
search_graph(id, path) AS
(
SELECT  id, ARRAY[id]
FROM    node
WHERE   id = :node_id
UNION ALL
SELECT  g.dest, sg.path || g.dest
FROM    search_graph sg
JOIN    graph g
ON      g.source = sg.id
AND NOT g.dest = ANY(sg.path)
)
SELECT  path
FROM    search_graph

Now it's perfectly fast, but we cannot create a view because that would only contain paths from one specific node.

Any ideas?

An often overlooked feature of PostgreSQL is its ability to create set-returning functions and use them in SELECT list.

The record will be cross-joined with the set returned by the function and the result of the join will be added to the resultset.

This is best demonstrated with generate_series, probably a most used PostgreSQL set-returning function.
Read the rest of this entry »

Written by Quassnoi

December 24th, 2010 at 11:00 pm

Posted in PostgreSQL

10 things in MySQL (that won’t work as expected)

with 44 comments

(I just discovered cracked.com)

#10. Searching for a NULL

SELECT  *
FROM    a
WHERE   a.column = NULL

In SQL, a NULL is never equal to anything, even another NULL. This query won't return anything and in fact will be thrown out by the optimizer when building the plan.

When searching for NULL values, use this instead:

SELECT  *
FROM    a
WHERE   a.column IS NULL

Read the rest of this entry »

Written by Quassnoi

November 3rd, 2010 at 11:00 pm

Posted in MySQL

Mixed ASC/DESC sorting in MySQL

with 10 comments

From Stack Overflow:

I have a table of blog posts, each with a foreign key back to it's author. There are less than 15,000 entries in this table.

This query scans over 19,000 rows (per EXPLAIN), requires a filesort (that might be regular MySQL behavior), and takes over 400ms to return 5 rows. possibly because of the complicated WHERE used to check if the item is actually published.

SELECT  blog_post.id,
blog_post.title,
blog_post.author_id,
blog_post.has_been_fact_checked,
blog_post.published_date,
blog_post.ordering,
auth_user.username,
auth_user.email
FROM    blog_post
INNER JOIN
auth_user
ON      auth_user.id = blog_post.author_id
WHERE   blog_post.is_approved = 1
AND blog_post.has_been_fact_checked = 1
AND blog_post.published_date <=  '2010-10-25 22:40:05'
ORDER BY
        blog_post.published_date DESC,
        blog_post.ordering ASC,
        blog_post.id DESC
LIMIT 5


How can I wrangle this query under control?

This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the ORDER BY perfectly, so the whole query could be executed using an index scan without any filesorts.

The only problem is that we have the mixed directions in ORDER BY, and MySQL does not support ASC / DESC clauses for the indexes.

With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the ordering column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That's what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.

This solution is nice, I hear people saying, but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think.

OK, this is a good point. Let's see what we can do with the current design, and, as always, create a sample table to do it:
Read the rest of this entry »

Written by Quassnoi

November 2nd, 2010 at 11:00 pm

Posted in MySQL