Author Archive
Things SQL needs: SERIES()
Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.
I believe this could be best illustrated with an example:
id | source | value |
---|---|---|
1 | 1 | 10 |
2 | 1 | 20 |
3 | 2 | 15 |
4 | 2 | 25 |
5 | 1 | 45 |
6 | 3 | 50 |
7 | 3 | 35 |
8 | 1 | 40 |
9 | 1 | 10 |
The records are ordered by id
, and within this order there are continuous series of records which share the same value of source
. In the table above, the series are separated by thick lines.
We want to calculate some aggregates across each of the series: MIN
, MAX
, SUM
, AVG
, whatever:
source | min | max | sum | avg |
---|---|---|---|---|
1 | 10 | 20 | 30 | 15.00 |
2 | 15 | 25 | 40 | 20.00 |
1 | 45 | 45 | 45 | 45.00 |
3 | 35 | 50 | 85 | 42.50 |
1 | 10 | 40 | 50 | 25.00 |
This can be used for different things. I used that for:
- Reading sensors from a moving elevator (thus tracking its position)
- Recording user's activity on a site
- Tracking the primary node in a server cluster
, but almost any seasoned database developer can recall a need for such a query.
As you can see, the values of source
are repeating so a mere GROUP BY
won't work here.
In the systems supporting window functions there is a workaround for that:
Late row lookups: InnoDB
Answering questions asked on the site.
Aryé asks:
Thanks for your article about late row lookups in MySQL.
I have two questions for you please:
- Is this workaround specific to MyISAM engine?
- How does PostgreSQL handle this?
The questions concerns a certain workaround for MySQL LIMIT … OFFSET
queries like this:
SELECT * FROM mytable ORDER BY id LIMIT 10 OFFSET 10000
which can be improved using a little rewrite:
SELECT m.* FROM ( SELECT id FROM mytable ORDER BY id LIMIT 10 OFFSET 10000 ) q JOIN mytable m ON m.id = q.id ORDER BY m.id
For the rationale behind this improvement, please read the original article.
Now, to the questions.
The second questions is easy: PostgreSQL won't pull the fields from the table until it really needs them. If a query involving an ORDER BY
along with LIMIT
and OFFSET
is optimized to use the index for the ORDER BY
part, the table lookups won't happen for the records skipped.
Though PostgreSQL does not reflect the table lookups in the EXPLAIN
output, a simple test would show us that they are done only LIMIT
times, not OFFSET + LIMIT
(like MySQL does).
Now, let's try to answer the first question: will this trick improve the queries against an InnoDB table?
To do this, we will create a sample table:
Happy New Year: drawing clock in SQL
Some say SQL is not good at graphics.
Well, they have some point. Database engines lack scanner drivers, there is no easy way to do sepia, and magic wand, let's be honest, is just poor.
However, you can make some New Year paintings with SQL.
Let's make a New Year clock showing 12 o'clock in Oracle.
PostgreSQL: parametrizing a recursive CTE
Answering questions asked on the site.
Jan Suchal asks:
We've started playing with PostgreSQL and recursive queries. Looking at example that does basic graph traversal from http://www.postgresql.org/docs/9.0/static/queries-with.html.
We would like to have a view called
paths
defined like this:WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path FROM search_graphBy calling
SELECT * FROM paths WHERE path[1] = :node_idwe would get all paths from a certain node.
The problem here is with performance. When you want this to be quick you need to add a condition for the anchor part of the
UNION
like this:WITH RECURSIVE search_graph(id, path) AS ( SELECT id, ARRAY[id] FROM node WHERE id = :node_id UNION ALL SELECT g.dest, sg.path || g.dest FROM search_graph sg JOIN graph g ON g.source = sg.id AND NOT g.dest = ANY(sg.path) ) SELECT path FROM search_graphNow it's perfectly fast, but we cannot create a view because that would only contain paths from one specific node.
Any ideas?
An often overlooked feature of PostgreSQL is its ability to create set-returning functions and use them in SELECT
list.
The record will be cross-joined with the set returned by the function and the result of the join will be added to the resultset.
This is best demonstrated with generate_series
, probably a most used PostgreSQL set-returning function.
Read the rest of this entry »
10 things in MySQL (that won’t work as expected)
(I just discovered cracked.com)
#10. Searching for a NULL
SELECT * FROM a WHERE a.column = NULL
In SQL, a NULL
is never equal to anything, even another NULL
. This query won't return anything and in fact will be thrown out by the optimizer when building the plan.
When searching for NULL
values, use this instead:
SELECT * FROM a WHERE a.column IS NULL
Mixed ASC/DESC sorting in MySQL
From Stack Overflow:
I have a table of blog posts, each with a foreign key back to it's author. There are less than 15,000 entries in this table.
This query scans over 19,000 rows (per
EXPLAIN
), requires afilesort
(that might be regular MySQL behavior), and takes over 400ms to return 5 rows. possibly because of the complicatedWHERE
used to check if the item is actually published.SELECT blog_post.id, blog_post.title, blog_post.author_id, blog_post.has_been_fact_checked, blog_post.published_date, blog_post.ordering, auth_user.username, auth_user.email FROM blog_post INNER JOIN auth_user ON auth_user.id = blog_post.author_id WHERE blog_post.is_approved = 1 AND blog_post.has_been_fact_checked = 1 AND blog_post.published_date <= '2010-10-25 22:40:05' ORDER BY blog_post.published_date DESC, blog_post.ordering ASC, blog_post.id DESC LIMIT 5
How can I wrangle this query under control?
This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the ORDER BY
perfectly, so the whole query could be executed using an index scan without any filesorts.
The only problem is that we have the mixed directions in ORDER BY
, and MySQL does not support ASC / DESC
clauses for the indexes.
With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the ordering
column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That's what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.
This solution is nice
, I hear people saying, but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think
.
OK, this is a good point. Let's see what we can do with the current design, and, as always, create a sample table to do it:
Read the rest of this entry »
Correlated LEFT JOIN in a subquery
Answering questions asked on the site.
Beren asks:
In my database, I store the information about the game matches.
A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players.
I store the matches in a table like this:
Game id player1 player2 type , where
type
isENUM(player, team)
.If
type
isplayer
, the ids of players are stored in the record; if type isteam
, those of teams are stored.Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.
Are two separate queries required for this?
This of course can be easily done using a single query.
Let's make a sample table and see:
Read the rest of this entry »
OLAP in MySQL: four ways to filter on higher level dimensions
Answering questions asked on the site.
Denis Kuzmenok asks:
I need some help with a query I'm fighting with.
I have these tables:
Product id parent name
Site id region name
Price id product site value For the products with a certain
product.parent
, I need to select minimal and maximal price within the givensite.region
, and return first 10 products ordered by the minimal price.Each
parent
has 100 to 10,000 products, and there are 10,000,000 records inprice
table.
We have a classical OLAP task here: a fact table (price
) and two dimension tables (product
and site
).
The task would be almost trivial if we were given the exact values of product
and site
. This way, we could build a composite index on these fields and value
, filter on the exact values and get the first 10 values from the index.
However, we are given not the values of the dimensions themselves but those of the higher levels (parent
for product
and region
for site
).
Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.
This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.
Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.
There are three tables and, hence, 3! = 6 permutations that define all possible join orders:
product | site | price |
product | price | site |
site | price | product |
site | product | price |
price | product | site |
price | site | product |
However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:
product/site | price | |
product | price | site |
site | price | product |
price | product/site |
The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.
Now, let's build the queries for all types of the join. To do this, we will create sample tables:
Read the rest of this entry »
20 latest unique records
From Stack Overflow:
I have a logfile which logs the insert/delete/updates from all kinds of tables.
I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (
datetime DESC
)
A common solution for such a task would be writing an aggregate query with ORDER BY
and LIMIT
:
SELECT person, MAX(ts) AS last_update FROM logfile GROUP BY person ORDER BY last_update DESC LIMIT 20
What's bad in this solution? Performance, as usual.
Since last_update
is an aggregate, it cannot be indexed. And ORDER BY
on unindexed fields results in our good old friend, filesort
.
Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on (person, ts)
, MySQL
will tend to use a loose index scan on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by MySQL
.
We should use another approach here. Let's create a sample table and test this approach:
Read the rest of this entry »
Questions
I had tons of work the last month and got a whole bunch of questions in the queue which I haven't answered yet.
Pawel, Mark, Kate, Felix, Andrew, Ludwig, Santos, Jeremy, John, another John, Anjan and Dave — I do remember about you, guys!
Will try to resume the normal blogging schedule the next week.
Stay subscribed!