Archive for June, 2009
Ordering NULLs
Comments enabled. I *really* need your comment
From Stack Overflow:
In our product we have a generic search engine, and trying to optimize the search performance.
A lot of the tables used in the queries allow
NULL
values.Should we redesign our table to disallow
NULL
values for optimization or not?
NULL
values surely affect some optimizer decisions in Oracle.
It's widely known that Oracle won't include a column into index if all key values are NULL
.
This means that if we don't apply a filter which would imply non-NULL
condition on the column (like, any comparison operator in the WHERE
clause, or a comparison operation in the ON
condition of a JOIN
clause), the index is not guaranteed to return all values, and a FULL SCAN
will be required on the table.
It is less evident that ORDER BY
queries can be affected too.
Let's create a sample table and see how:
Read the rest of this entry »
Counting bans
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a MyISAM table in MySQL 5.0 like this:
id login ip banned I would like to find all users not banned (
banned = 0
) if at least 5 other users with the sameip
have already been banned (banned = 1
).
To do this, we should first find which IP
's have more than 5 bans and then find all non-banned users for these IP
's.
To find the IP
's we can use two methods:
- Find all banned
IP
's, group them and filter out those that haveCOUNT(*) < 5
- For each
IP
possible, find if there are 5 banned records for thisIP
using aLIMIT
clause
The first method doesn't require doing an extra JOIN
on the table, but it will do count all values just to figure out if the COUNT(*)
is more or less than 5.
The second method stops counting whenever it find the 5th value, but it requires an extra JOIN
.
Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN
loops.
This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP
's.
Let's create two sample tables and see:
Read the rest of this entry »
Efficient EXISTS
Comments enabled. I *really* need your comment
In my previous article:
I compared efficiency of different methods to check for existence of a value in a subquery resultset.
These methods included IN
and EXISTS
predicates.
It was shown that in SQL Server, a poorly written EXISTS
can be very slow, while an IN
predicate is almost always efficiently optimized using one of the numerous SEMI JOIN
methods.
However, a properly written EXISTS
predicate can outperform IN
.
Let's create a sample table and see how:
Read the rest of this entry »
IN vs. JOIN vs. EXISTS
From Stack Overflow:
I'm wanting to select rows in a table where the primary key is in another table.
I'm not sure if I should use a
JOIN
or theIN
operator in SQL Server 2005:SELECT * FROM a WHERE a.c IN ( SELECT d FROM b )SELECT a.* FROM a JOIN b ON a.c = b.dIs there any significant performance difference between these two SQL queries with a large dataset (i. e. millions of rows)?
Before we begin I should note that the queries provided by the author are not similar in fact and could produce different resultsets.
If the table used in the subquery returns a value twice, a JOIN
will also return the matching rows twice, while an IN
condition will return them only once.
That's why it's more correct to compare to the following query:
SELECT a.* FROM a JOIN ( SELECT DISTINCT d FROM b ) bo ON a.c = bo.d
Now, à nos moutons.
The folk wisdom advises against using IN
constructs in favor of JOIN
's because they say that the subquery will be executed many times in a loop
.
Let's create some sample tables and see if it's really so:
Read the rest of this entry »
Group based identity
Comments enabled. I *really* need your comment
From Stack Overflow:
I'm working with an application that adds a new row to the database, based on the last row meeting a certain criteria.
Here is an over-simplified visualization:
A1 A2 A3 B1 B2 Using the visualization above, a web page loads up the highest B value, which is 2.
Then, after some time, it wants to insert B3, the next record in the series.
However, it has to check to make sure that someone else didn't do the same thing.
In actuality, each row corresponds to a place that a piece of equipment visited in a factory.
The page (not easy to rewrite), currently moves the equipment to the next step.
Is there a standard pattern for dealing with this type of problem, or do I simply need to lock the table?
This question combines two other frequently asked questions:
- How do I make group-based identity (i. e. keep several identity series identified by the value of a certain column)
- How do I ensure there will be no identity gaps?
While there are numerious solutions that involve some kind of selecting MAX(column)
and updating the row inside a transaction, there is a more simple way using linked lists.
We need to create the table as following:
Read the rest of this entry »
Flattening timespans: PostgreSQL
Comments enabled. I *really* need your comment
In previous three articles:
, I described various approaches to merging a set of intersecting timespans into a smaller set of larger, non-intersecting ranges.
Today, I'll describe how to do it in PostgreSQL.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
PostgreSQL implements various procedural languages, like pl/PGSQL, pl/perl etc., which can be used to solve almost any task using procedural approaches.
Using, for instance, pl/PGSQL, we can implement an approach similar to one we used for SQL Server.
We should create a set-returning function, open a cursor inside it and select all timespans ordered by start
, keeping track of a running maximum of stop
.
Whenever we find a timespan that starts later than end all previously started timespans, we know we have a gap.
Here's the sample table:
Read the rest of this entry »
Flattening timespans: MySQL
Comments enabled. I *really* need your comment
In the previous articles:
, I described how to merge intersecting timespans into several contiguous ranges.
In SQL Server, this task required a table-valued function (TVF) and a cursor.
In Oracle, we had to employ window functions to do this.
Now let's see how to solve this task in MySQL.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
MySQL offers easy use of session variables. These are loosely typed variables defined within the session scope which you can assign and evaluate right inside the query.
Using these variables may come in handy for tasks like this.
In MySQL, just like in previous solutions, we will need to order the timespans by start
anc calculate the running maximum of the stop
.
This is easily done using session variable @edate
which we will use to store running maximum. We will update in in the SELECT
clause.
We will also declare another variable @r
, initialize it with a zero and use it to keep a range number
.
If we have a timespan whose start
exceeds the previous value of the running maximum then all timespans that had started eariler had also been stopped by this time, and we have a gap.
The current timespan, therefore, belongs to another range, and we need to increment the value of @r
.
Finally, we will just group the timespans on the range number and return MIN(start)
and MAX(stop)
for all timespans in a range. Easy to see that these will be the start
and the stop
for the whole range.
Here is the sample table:
Read the rest of this entry »
Flattening timespans: Oracle
Comments enabled. I *really* need your comment
In my previous article:
I described a function that merges intersecting timespanss into one contiguous range in SQL Server.
This query turned out to be a good illustration for window functions, so now we'll try to repeat this query in Oracle which is superb in terms of window functions support.
A quick reminder of the problem, taken from Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
As you may remember, we had to use the cursors in the SQL Server decision, since SQL Server lacks a way to calculate a running maximum.
But Oracle does have such a way, so we don't need cursors and functions here, and everything can be done in a single query.
The principle remains the same and is just changed a little to fit into one query:
- Assign a
ROW_NUMBER()
to each row (since the rows lack aPRIMARY KEY
, we will need it later to use as one). - Order the rows by
start
and calculate running maximum for the stop. - Pick all rows that have a
start
greater than the running maximum of the previous row. Having such a timespan means that all timespans evaluated before had started earlier than the given one (since they are ordered bystart
) and also had ended earlier than the given one have started (since its start exceeds the running maximum). Therefore, we have a gap. - Pick the
ROW_NUMBER()
of each gap row along with theROW_NUMBER()
of the record that comes right before the gap row next to the given one. - Use these
ROW_NUMBER()
's to find out thestart
and thestop
of the corresponding records. These will be the timespan bounds.
Let's create a sample table and compose the query:
Read the rest of this entry »
Flattening timespans: SQL Server
Comments enabled. I *really* need your comment
From Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
This is also quite a common task.
The algorithm here is simple:
- Order all timespans by
start
- Initialize a variable which will hold a running maximum of the timespans'
stop
- Take the timespans one by one, and for each next timespan, check its
start
against a running maximum of the previousstop
's. If the currentstart
is greater than the maximum of previousstop
's, it means that all timespans that started earlier have been ended, and we have a gap - Update the running maximum
Unfortunately, SQL Server
does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.
This is one of quite rare cases when cursor in SQL Server
work faster than set-based operations.
Let's create a sample table and see how it works:
Read the rest of this entry »
Joining unjoinable
Comments enabled. I *really* need your comment
From Stack Overflow:
I want to combine two tables into one.
Let say I have:
Table1:
ID Name 1 A 2 B 3 C Table2:
ID Name 4 D 5 E 6 F I want to make Table3:
Name1 Name2 A D B E C F How can I do this in SQL Server?
This a certainly a task for a JOIN
, since we have two tables here and want to match any row from the first table with another row from the second table.
But what condition do we use for a JOIN
?
The ID
's of the rows seem to be arbitrary and have no special meaning. They could as well be sparse and randomly distributed, so it's impossible to build a formula which corresponds one ID
to another. Same is true for the names.
However, there actually is a condition which we could JOIN
on.
Since the author mentioned there is same number of rows in both tables, we could use the ROW_NUMBER()
as a JOIN
condition.
We just need to wrap both these queries into CTE's which select a ROW_NUMBER()
along with the table data:
WITH table1 AS ( SELECT 1 AS id, 'A' AS name UNION ALL SELECT 2 AS id, 'B' AS name UNION ALL SELECT 3 AS id, 'C' AS name ), table2 AS ( SELECT 4 AS id, 'D' AS name UNION ALL SELECT 5 AS id, 'E' AS name UNION ALL SELECT 6 AS id, 'F' AS name ), q1 AS ( SELECT table1.*, ROW_NUMBER() OVER (ORDER BY name) AS rn FROM table1 ), q2 AS ( SELECT table2.*, ROW_NUMBER() OVER (ORDER BY name) AS rn FROM table2 ) SELECT q1.name, q2.name FROM q1 JOIN q2 ON q2.rn = q1.rn
name | name | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | D | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B | E | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
C | F | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0002s (0.0264s) |
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 4 ms.