EXPLAIN EXTENDED

How to create fast database queries

Author Archive

Flattening timespans: PostgreSQL

Comments enabled. I *really* need your comment

In previous three articles:

, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.

Today, I'll describe how to do it in PostgreSQL.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.

Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.

We should create a set-returning function, open a cursor inside it and select all timespans ordered by start, keeping track of a running maximum of stop.

Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.

Here's the sample table:
Read the rest of this entry »

Written by Quassnoi

June 14th, 2009 at 11:00 pm

Posted in PostgreSQL

Flattening timespans: MySQL

Comments enabled. I *really* need your comment

In the previous articles:

, I described how to merge intersecting timespans into several contiguous ranges.

In SQL Server, this task required a table-valued function (TVF) and a cursor.

In Oracle, we had to employ window functions to do this.

Now let's see how to solve this task in MySQL.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

MySQL offers easy use of session variables. These are loosely typed variables defined within the session scope which you can assign and evaluate right inside the query.

Using these variables may come in handy for tasks like this.

In MySQL, just like in previous solutions, we will need to order the timespans by start anc calculate the running maximum of the stop.

This is easily done using session variable @edate which we will use to store running maximum. We will update in in the SELECT clause.

We will also declare another variable @r, initialize it with a zero and use it to keep a range number.

If we have a timespan whose start exceeds the previous value of the running maximum then all timespans that had started eariler had also been stopped by this time, and we have a gap.

The current timespan, therefore, belongs to another range, and we need to increment the value of @r.

Finally, we will just group the timespans on the range number and return MIN(start) and MAX(stop) for all timespans in a range. Easy to see that these will be the start and the stop for the whole range.

Here is the sample table:
Read the rest of this entry »

Written by Quassnoi

June 13th, 2009 at 11:00 pm

Posted in MySQL

Flattening timespans: Oracle

Comments enabled. I *really* need your comment

In my previous article:

I described a function that merges intersecting timespanss into one contiguous range in SQL Server.

This query turned out to be a good illustration for window functions, so now we'll try to repeat this query in Oracle which is superb in terms of window functions support.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

As you may remember, we had to use the cursors in the SQL Server decision, since SQL Server lacks a way to calculate a running maximum.

But Oracle does have such a way, so we don't need cursors and functions here, and everything can be done in a single query.

The principle remains the same and is just changed a little to fit into one query:

  1. Assign a ROW_NUMBER() to each row (since the rows lack a PRIMARY KEY, we will need it later to use as one).
  2. Order the rows by start and calculate running maximum for the stop.
  3. Pick all rows that have a start greater than the running maximum of the previous row. Having such a timespan means that all timespans evaluated before had started earlier than the given one (since they are ordered by start) and also had ended earlier than the given one have started (since its start exceeds the running maximum). Therefore, we have a gap.
  4. Pick the ROW_NUMBER() of each gap row along with the ROW_NUMBER() of the record that comes right before the gap row next to the given one.
  5. Use these ROW_NUMBER()'s to find out the start and the stop of the corresponding records. These will be the timespan bounds.

Let's create a sample table and compose the query:
Read the rest of this entry »

Written by Quassnoi

June 12th, 2009 at 11:00 pm

Posted in Oracle

Flattening timespans: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

This is also quite a common task.

The algorithm here is simple:

  1. Order all timespans by start
  2. Initialize a variable which will hold a running maximum of the timespans' stop
  3. Take the timespans one by one, and for each next timespan, check its start against a running maximum of the previous stop's. If the current start is greater than the maximum of previous stop's, it means that all timespans that started earlier have been ended, and we have a gap
  4. Update the running maximum

Unfortunately, SQL Server does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.

This is one of quite rare cases when cursor in SQL Server work faster than set-based operations.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

June 11th, 2009 at 11:00 pm

Posted in SQL Server

Joining unjoinable

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to combine two tables into one.

Let say I have:

Table1:

ID Name
1 A
2 B
3 C

Table2:

ID Name
4 D
5 E
6 F

I want to make Table3:

Name1 Name2
A D
B E
C F

How can I do this in SQL Server?

This a certainly a task for a JOIN, since we have two tables here and want to match any row from the first table with another row from the second table.

But what condition do we use for a JOIN?

The ID's of the rows seem to be arbitrary and have no special meaning. They could as well be sparse and randomly distributed, so it's impossible to build a formula which corresponds one ID to another. Same is true for the names.

However, there actually is a condition which we could JOIN on.

Since the author mentioned there is same number of rows in both tables, we could use the ROW_NUMBER() as a JOIN condition.

We just need to wrap both these queries into CTE's which select a ROW_NUMBER() along with the table data:

WITH    table1 AS
        (
        SELECT  1 AS id, 'A' AS name
        UNION ALL
        SELECT  2 AS id, 'B' AS name
        UNION ALL
        SELECT  3 AS id, 'C' AS name
        ),
        table2 AS
        (
        SELECT  4 AS id, 'D' AS name
        UNION ALL 
        SELECT  5 AS id, 'E' AS name
        UNION ALL
        SELECT  6 AS id, 'F' AS name
        ),
        q1 AS
        (
        SELECT  table1.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
        FROM    table1
        ),
        q2 AS
        (
        SELECT  table2.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
        FROM    table2
        )
SELECT  q1.name, q2.name
FROM    q1
JOIN    q2
ON      q2.rn = q1.rn

name name
A D
B E
C F
3 rows fetched in 0.0002s (0.0264s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms. 

Written by Quassnoi

June 10th, 2009 at 11:00 pm

Posted in SQL Server

Latest DISTINCT records: efficient subquery

Comments enabled. I *really* need your comment

In my yesterday's article:

I described an efficient query to select 10 latest distinct IP's from a pagehit table in MySQL.

Here's the query:

SELECT  INET_NTOA(ip), ts
FROM    t_latest2 lo
WHERE   NOT EXISTS (
        SELECT  1
        FROM    t_latest2 li
        WHERE   li.ip = lo.ip
                AND li.ts > lo.ts
        )
ORDER BY
        ts DESC
LIMIT 10

, which works all right if there are lots of distinct IP's (as it is in a real table).

However, due to the bug in MySQL, this query is not as efficient as it should be.

Let's create the sample table and see what the problem is:
Read the rest of this entry »

Written by Quassnoi

June 9th, 2009 at 11:00 pm

Posted in MySQL

Latest DISTINCT records

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table containing pagehit data in MySQL.

How do I grab the 10 latest unique IP's from that table?

This is quite a common task and there are two approaches to solve it.

Let's create two sample tables that differ in IP distribution:
Read the rest of this entry »

Written by Quassnoi

June 8th, 2009 at 11:00 pm

Posted in MySQL

Longest common prefix: MySQL

Comments enabled. I *really* need your comment

This is the fourth and final article of the series on how to strip all strings in a set of their longest common prefix and concatenate the results:

, and today we will solve this task for MySQL.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

MySQL solution differs from the previous ones a little.

On the one hand, MySQL is the only RDBMS of all listed above that supplies a built-in GROUP_CONCAT (aggregate function to concatenate strings).

On the other hand, MySQL lacks a way to generate an arbitrary resultset which we relied upon in all solutions above.

This makes it easier to concatenate the results, but a lot harder to find the longest common prefix.

Fortunately, there is a way to do it.

Since 5.1, MySQL offers an XPath manipulation function, namely ExtractValue, that will help us to solve this task.

Unfortunately, ExtractValue is not set generating, i. e. it cannot create an SQL resultset out of an XML nodeset. But XPath itself is quite a powerful tool, and we can implement the longest common prefix algorithm completely in XPath.
Read the rest of this entry »

Written by Quassnoi

June 7th, 2009 at 11:00 pm

Posted in MySQL

Longest common prefix: Oracle

Comments enabled. I *really* need your comment

Today, the third article of the series on how to strip all strings in a set of their longest common prefix and concatenate the results:

This article will describe how to do it in Oracle.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

As I already wrote earlier, this solution needs two aggregates: first one to calculate the length of the longest common prefix and the second one to concatenate the strings.

Oracle allows creating custom aggregates, but it's usually a pain to implement them. That's why we will use Oracle's MODEL clause to solve this task.
Read the rest of this entry »

Written by Quassnoi

June 6th, 2009 at 11:00 pm

Posted in Oracle

Longest common prefix: PostgreSQL

Comments enabled. I *really* need your comment

This is a series of articles on how to strip all strings in a set of their longest common prefix and concatenate the results:

Today, I'll show how to do it in PostgreSQL.

A quick reminder of the problem (taken from Stack Overflow):

I have some data:

id ref
1 3536757616
1 3536757617
1 3536757618
2 3536757628
2 3536757629
2 3536757630

and want to get the result like this:

id result
1 3536757616/7/8
2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

Like with SQL Server, it is possible to do this in PostgreSQL using a single SQL query.

But since PostgreSQL offers a nice ability to create custom aggregates, I'll better demonstrate how to solve this task using ones.

In my opinion, custom aggregates fit here just perfectly.

Since PostgreSQL lacks native support for aggregate concatenation, we will need to create two custom aggregates here:
Read the rest of this entry »

Written by Quassnoi

June 5th, 2009 at 11:00 pm

Posted in PostgreSQL