Archive for 2009
MySQL: grouping continuous ranges
From Stack Overflow:
Given the following table:
ID State Date 12 1 2009-07-16 10:00 45 2 2009-07-16 13:00 67 2 2009-07-16 14:40 77 1 2009-07-16 15:00 89 1 2009-07-16 15:30 99 1 2009-07-16 16:00 , how can I group by the field
state
, while still maintaining the borders between the state changes?I need
MIN(id)
,MIN(date)
andCOUNT(*)
for each group, and this is expected:
ID State Date Count 12 1 2009-07-16 10:00 1 45 2 2009-07-16 13:00 2 77 1 2009-07-16 15:00 3
Though a pure set-based solution exists, in MySQL, this can be easily done using session variables, which is more efficient.
Let's create a sample table:
PostgreSQL: last 12 weeks of a year
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a query like this:
SELECT EXTRACT(WEEK FROM j.updated_at) as "week", count(j.id) FROM jobs WHERE EXTRACT(YEAR FROM j.updated_at)=2009 GROUP BY EXTRACT(WEEK FROM j.updated_at) ORDER BY week, which works fine, but I only want to show the last 12 weeks.
LIMIT 12
works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.
To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:
SELECT * FROM ( SELECT EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt FROM jobs WHERE EXTRACT(year FROM updated_at) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.
More than that, EXTRACT(year FROM updated)
is not a sargable predicate, and all table rows (or index rows) need to be examined.
On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:
SELECT * FROM ( SELECT EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt FROM t_week WHERE EXTRACT(year FROM updated) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
week | cnt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | 3584 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42 | 3467 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
43 | 3499 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
44 | 3535 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45 | 3511 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46 | 3621 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
47 | 3529 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | 3500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | 3415 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | 3536 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
51 | 3504 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
52 | 3479 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 rows fetched in 0.0004s (3.5843s) |
Sort (cost=58517.44..58517.47 rows=11 width=16) Sort Key: (date_part('week'::text, t_week.updated)) -> Limit (cost=58517.11..58517.14 rows=11 width=8) -> Sort (cost=58517.11..58517.14 rows=11 width=8) Sort Key: (date_part('week'::text, t_week.updated)) -> HashAggregate (cost=58516.75..58516.92 rows=11 width=8) -> Seq Scan on t_week (cost=0.00..58491.75 rows=5000 width=8) Filter: (date_part('year'::text, updated) = 2008::double precision)
However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.
Let's create a sample table and see how to do this:
Hierarchial queries in MySQL: identifying trees
Comments enabled. I *really* need your comment
Continuing on the theme of hierarchical queries in MySQL:
- Hierarchical queries in MySQL
- Hierarchical queries in MySQL: adding level
- Hierarchical queries in MySQL: adding ancestry chains.
- Hierarchical queries in MySQL: finding leaves
- Hierarchical queries in MySQL: finding loops
- Hierarchical data in MySQL: parents and children in one query
Assume we have a table with hierarchical structure like this:
treeitem | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 rows fetched in 0.0003s (0.0137s) |
We have two trees here: one starting from 1, another one starting from 2.
The problem is: given any item, we should identify the whole tree this item belongs to, and return the whole tree in the hierarchical order.
This also can be easily done using hierarchical queries in MySQL.
In this article: Hierarchical queries in MySQL I shown how to implement a function that returnes tree items in correct order, being called sequentially.
This function is reentrable and keeps its state in session variables, one of which, @start_with
, defines the parent element for the tree we want to build.
We have two problems here:
- Given an item, define a root of the tree it belongs to
- Build a whole tree, starting from the root
The first problem can be solved by iterating the linked list backwards, starting from the variable given as an input.
This article:
describes how to do it in great detail, that's why I'll just put a query here.
Let's create the table described above:
PostgreSQL: round-robin order
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a bunch of URLs stored in a table waiting to be scraped by a script.
However, many of those URLs are from the same site. I would like to return those URLs in a
site-friendlyorder (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.The database layout is something like this:
create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );Example result:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site)
url http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:
number | site | url |
---|---|---|
1 | example.com | http://www.example.com/some/file |
1 | stackoverflow.com | http://stackoverflow.com/questions/ask |
1 | use.perl.org | http://use.perl.org/ |
2 | example.com | http://www.example.com/some/other/file |
2 | stackoverflow.com | http://stackoverflow.com/tags |
Good news: in new PostgreSQL 8.4, we just can use a window function to do this:
SELECT url FROM urls ORDER BY ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site
url | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a.example.com/url104 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b.example.com/url1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c.example.com/url106 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
y.example.com/url102 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
z.example.com/url103 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a.example.com/url130 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b.example.com/url105 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c.example.com/url132 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
l.example.com/url999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
m.example.com/url974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000 rows fetched in 0.0169s (0.0178s) |
Yes, that simple.
Bad news: in PostgreSQL 8.3 and below, we still need to hack.
Let's create a sample table and see how:
Hierarchical data in MySQL: parents and children in one query
Answering questions asked on the site.
Michael asks:
I was wondering how to implement a hierarchical query in MySQL (using the ancestry chains version) for a single row, such that it picks up the parents (if any) and any children (if any).
The idea is, I want to be able to jump in at any point, provide an Id of some sort, and be able to draw out the entire hierarchy for that Id, both upwards and downwards.
We need to combine two queries here:
- Original hierarchical query that returns all descendants of a given
id
(a descendancy chain) - A query that would return all ancestors of a given
id
(an ancestry chain)
An id
can have only one parent
, that's why we can employ a linked list technique to build an ancestry chain, like shown in this article:
Here's the query to to this (no functions required):
SELECT @r AS _id, ( SELECT @r := parent FROM t_hierarchy WHERE id = _id ) AS parent, @l := @l + 1 AS lvl FROM ( SELECT @r := 1218, @l := 0, @cl := 0 ) vars, t_hierarchy h WHERE @r <> 0
To combine two queries, we can employ a simple UNION ALL
.
The only problem that is left to preserve the correct level
, since the ancestry chain query conts level
backwards, and the hierarchical query will count it starting from zero.
Let's create a sample table and see what we get:
Read the rest of this entry »
PostgreSQL 8.4: sampling random rows
On Jul 1, 2009, PostgreSQL 8.4 was released.
In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.
Other articles on new features of PostgreSQL 8.4:
Today, I'll show a way to sample random rows from a PRIMARY KEY preserved table.
Usually, if we need, say, 10 random rows from a table, we issue this query:
SELECT * FROM t_random ORDER BY RANDOM() LIMIT 10
PostgreSQL heavily optimizes this query, since it sees a LIMIT
condition and does not sort all rows. Instead, it just keeps a running buffer which contains at most 10 rows with the least values or RANDOM
calculated so far, and when a row small enough is met, it sorts only this buffer, not the whole set.
This is quite efficient, but still requires a full table scan.
This can be a problem, since the queries like that are often run frequently on heavily loaded sites (like for showing 10 random pages on social bookmarking systems), and full table scans will hamper performance significantly.
With new PosgreSQL 8.4 abilities to run recursive queries, this can be improved.
We can sample random values of the row id
s and use an array to record previously selected values.
Let's create a sample table and see how can we imrove this query:
Read the rest of this entry »
PostgreSQL 8.4: preserving order for hierarchical query
On Jul 1, 2009, PostgreSQL 8.4 was released.
In this series of articles, I'd like to show how to reimplement some tasks I wrote about in the previous blog posts using new PostgreSQL features.
Previously in the series:
Now, let's see how we can implement the hierarchical queries using the new features of PostgreSQL 8.4.
In PostgreSQL 8.3, we had to create a recursive function to do that. If you are still bound to 8.3 or an earlier version, you can read this article to see how to do it:
In 8.4, we have recursive CTE's (common table expressions).
Let's create a sample hierarchical table and see how can we query it:
Read the rest of this entry »
INNER JOIN vs. CROSS APPLY
From Stack Overflow:
Can anyone give me a good example of when
CROSS APPLY
makes a difference in those cases whereINNER JOIN
will work as well?
This is of course SQL Server.
A quick reminder on the terms.
INNER JOIN
is the most used construct in SQL: it joins two tables together, selecting only those row combinations for which a JOIN
condition is true.
This query:
SELECT * FROM table1 JOIN table2 ON table2.b = table1.a
reads:
For each row from
table1
, select all rows fromtable2
where the value of fieldb
is equal to that of fielda
Note that this condition can be rewritten as this:
SELECT * FROM table1, table2 WHERE table2.b = table1.a
, in which case it reads as following:
Make a set of all possible combinations of rows from
table1
andtable2
and of this set select only the rows where the value of fieldb
is equal to that of fielda
These conditions are worded differently, but they yield the same result and database systems are aware of that. Usually both these queries are optimized to use the same execution plan.
The former syntax is called ANSI syntax, and it is generally considered more readable and is recommended to use.
However, it didn't make it into Oracle until recently, that's why there are many hardcore Oracle developers that are just used to the latter syntax.
Actually, it's a matter of taste.
To use JOIN
s (with whatever syntax), both sets you are joining must be self-sufficient, i. e. the sets should not depend on each other. You can query both sets without ever knowing the contents on another set.
But for some tasks the sets are not self-sufficient. For instance, let's consider the following query:
We have
table1
andtable2
.table1
has a column calledrowcount
.For each row from
table1
we need to select firstrowcount
rows fromtable2
, ordered bytable2.id
We cannot come up with a join condition here. The join condition, should it exist, would involve the row number, which is not present in table2
, and there is no way to calculate a row number only from the values of columns of any given row in table2
.
That's where the CROSS APPLY
can be used.
CROSS APPLY
is a Microsoft's extension to SQL, which was originally intended to be used with table-valued functions (TVF's).
The query above would look like this:
SELECT * FROM table1 CROSS APPLY ( SELECT TOP (table1.rowcount) * FROM table2 ORDER BY id ) t2
For each from
table1
, select firsttable1.rowcount
rows fromtable2
ordered byid
The sets here are not self-sufficient: the query uses values from table1
to define the second set, not to JOIN
with it.
The exact contents of t2
are not known until the corresponding row from table1
is selected.
I previously said that there is no way to join these two sets, which is true as long as we consider the sets as is. However, we can change the second set a little so that we get an additional computed field we can later join on.
The first option to do that is just count all preceding rows in a subquery:
SELECT * FROM table1 t1 JOIN ( SELECT t2o.*, ( SELECT COUNT(*) FROM table2 t2i WHERE t2i.id <= t2o.id ) AS rn FROM table2 t2o ) t2 ON t2.rn <= t1.rowcount
The second option is to use a window function, also available in SQL Server since version 2005:
SELECT * FROM table1 t1 JOIN ( SELECT t2o.*, ROW_NUMBER() OVER (ORDER BY id) AS rn FROM table2 t2o ) t2 ON t2.rn <= t1.rowcount
This function returns the ordinal number a row would have, be the ORDER BY
condition used in the function applied to the whole query.
This is essentially the same result as the subquery used in the previous query.
Now, let's create the sample tables and check all these solutions for efficiency:
Oracle: OR on multiple EXISTS clauses
Comments enabled. I *really* need your comment
From Stack Overflow:
I have two queries, and I want to understand which is better in terms of performance and memory:
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_b b, tbl_a a , tbl_c c, tbl_d d WHERE ( b.id1 = a.id1 AND a.id1 = c.id1 AND UPPER(c.flag) = 'Y' AND c.id1 = d.id1 ) OR ( b.id2 = a.id2 AND a.id2 = c.id2 AND UPPER(c.flag) = 'Y' AND c.id2 = d.id2 ) AND d.id3 = 10and
SELECT DISTINCT a.no, a.id1, a.id2 FROM tbl_a a WHERE EXISTS ( SELECT a.id1, a.id2 FROM tbl_c c WHERE (a.id1 = c.id1 OR a.id2 = c.id2) AND UPPER(c.flag) = 'Y' ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_b b WHERE b.id1 = a.id1 OR b.id2 = a.id2 ) AND EXISTS ( SELECT a.id1, a.id2 FROM tbl_d d WHERE (a.id1 = d.id1 or a.id2 = d.id2) AND d.id3 = 10 )The tables
tbl_b
andtbl_d
are very large tables containing 500,000 to millions of rows, while tabletbl_a
is relatively small.My requirement is to pick up only those records from table
tbl_a
, whoseid
(eitherid1
orid2
) is available intbl_b
,tbl_c
, andtbl_d
tables, satisfying certain other conditions as well.Which is best performance-wise?
We can see that both these queries contain an OR
condition, a nightmare for most optimizers.
The first query uses a join on all four tables, concatenating the results and making a distinct set out of them.
The second query check each row in tbl_a
, making sure that the corresponding records exists in other tables in one or another way.
These queries are not identical: the first query will select the rows from tbl_a
matching all tables on same id (either three matches on id1
or three matches on id2
), while the second query returns rows matching on any id
This is, if we have a row matching tbl_b
and tbl_c
on id1
and tbl_d
on id2
, this row will be returned by the second query but not the first.
Both these queries will perform poorly on large tables. However, we can improve them.
Let's create the tables, fill them with sample data and make the improvements:
Read the rest of this entry »
Flattening timespans: PostgreSQL 8.4
On Jul 1, 2009, PostgreSQL 8.4 was released.
Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE's).
Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer's life much, much easier.
Let's check how efficienly these features are implemented.
To do this, I'll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.
I'll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:
This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
Let's create a sample table:
Read the rest of this entry »