Archive for May, 2009
Finding loops in a tree
From Stack Overflow:
I have a view which matches up the needs and the wants, and discards the requests that cannot be fulfilled by any user, that looks like this:
Item Need Have 1 Bob George 2 George Herbie 3 Herbie Bob 4 Larry Wally 5 Wally Larry 6 John George I am now trying to write a query where I can calculate the number of permutations of each user having their requested needs fulfilled.
For instance, in the example above, Bob, George, and Herbie can together fulfill each other's needs, as can Larry and Wally, but John cannot, so the total count would be 2.
This task is actually about finding loops.
Whenever there is a set of users that fulfill each other's needs, there is a loop in the tree.
We can assume that either of the columns (need
or have
) is unique so there there is always a single path in the corresponding direction which ends with at most one loop.
It's not stated in the task, but otherwise it would be an equivalent of Clique problem which is NP
complete and hardly solvable with SQL
.
Now, let's create a sample table:
Read the rest of this entry »
Ranging random values
Comments enabled. I *really* need your comment
From Stack Overflow:
Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?
The ranges should not overlap and ideally have a similar amount of values in them.
This is a good job for SQL Server's analytic function NTILE
.
This function, given a parameter N
, assigns a rank from 1
to N
to each value, just as in quadrile (4), quintile (5), percentile (100) etc.
Lower 1/N
of values get a rank of 1
, from 1/N
to 2/N
get a rank of 2, etc.
The value, though, are integer and random, and we can get gaps between the adjacent ranges.
To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.
Say, NTILE
assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.
Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.
To do this, for each value selected, we will get also the previous one and the next one.
It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.
Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.
Let's create a sample table and see how it works:
Read the rest of this entry »
UNION vs UNION ALL
Comments enabled. I *really* need your comment
In my previous article:
I described efficient ORDER BY
along with LIMIT
for UNION
'ed resultsets.
One thing left unanswered: why UNION DISTINCT
in some cases is faster than UNION ALL
?
UNION DISTINCT
(for which a mere UNION
is an alias) implies selecting DISTINCT
values from the resultsets, while UNION ALL
just selects all values, including duplicates.
On one hand, UNION ALL
does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.
If some complex calculations are performed on these rows, this may kill all performance benefits.
Let's create the sample tables:
Read the rest of this entry »
Ordering UNION’s
Comments enabled. I *really* need your comment
From Stack Overflow:
When I optimize my 2 single queries to run in less than 0.02 seconds and then
UNION
them the resulting query takes over 1 second to run.Also, a
UNION ALL
takes longer than aUNION DISTINCT
.I would assume allowing duplicates would make the query run faster and not slower.
Am I really just better off running the 2 queries separately? I would prefer to use the
UNION
.
This kinds of things happen when the queries employ ORDER BY
along with LIMIT
.
Let's create the sample tables and see what is happening:
Read the rest of this entry »
PostgreSQL: selecting a function
Comments enabled. I *really* need your comment
From Stack Overflow:
Hello,
I want to write a
SELECT
statement as follows:SELECT field_a FROM my_table WHERE field_b IN (my_function(field_c))Is that possible?
Would
my_function
have to return an array?
It is of course possible in PostgreSQL using a set returning function.
Return type of such a function should be declared as SETOF
, so the function will return a rowset of given type.
There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.
Let's create sample tables:
Read the rest of this entry »
Oracle: returning deleted records
Comments enabled. I *really* need your comment
From Stack Overflow:
I need to return a rowset from an Oracle procedure, and then delete them in that same procedure.
Is there a neat way of doing this without temp tables?
Something like an in-memory cursor maybe?
Oracle has a nice feature of returning cursor variables from a stored procedure or an anonymous block.
As any cursor, the returned cursor is a handle to a stable rowset.
This means that it's not affected by any changes to database that were made after the cursor was open.
We can exploit this feature to achieve the required behavior.
Let's create a sample table:
Read the rest of this entry »
Indexing a link table
Comments enabled. I *really* need your comment
From Stack Overflow:
A relation table is the common solution to representing a many-to-many (
m:n
) relationship.In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:
CREATE TABLE t_a ( id INT NOT NULL PRIMARY KEY, value VARCHAR(50) NOT NULL ) CREATE TABLE t_b ( id INT NOT NULL PRIMARY KEY, value VARCHAR(50) NOT NULL ) CREATE TABLE t_ab ( a INT NOT NULL, b INT NOT NULL )How should it be indexed to provide optimal performance in every JOIN situation?
I'll consider SQL Server as the target platform this time.
We need all WHERE
, ORDER BY
and GROUP BY
conditions to be covered by the indexes.
On a first thought, we need to have indexes for all columns in all orders:
Read the rest of this entry »
PostgreSQL: emulating ROW_NUMBER
Comments enabled. I *really* need your comment
Note: this article concerns PostgreSQL 8.3 and below.
PostgreSQL 8.4 introduces window functions.
Window function ROW_NUMBER()
implements the functionality in question more efficiently.
In one of the previous articles:
, I described emulating Oracle's pseudocolumn ROWNUM
in PostgreSQL.
Now, we'll extend this query to emulate ROW_NUMBER
.
A quick reminder: ROW_NUMBER
is an analytical function in ANSI SQL 2003
supported by Oracle and MS SQL Server.
It enumerates each row in a resultset, but, unlike ROWNUM
, may take two additional parameters: PARTITION BY
and ORDER BY
.
PARTITION BY
splits a rowset into several partitions, each of them being numbered with its own sequence starting from 1.
ORDER BY
defines the order the rows are numbered within each partition. This order may differ from the order the rows are returned in.
This function helps building queries which allow to select N
rows for each partition.
Let's create a sample table and see how we do it in PostgreSQL:
Read the rest of this entry »
Creating indexes
Comments enabled. I *really* need your comment
Answering the questions asked on the site.
Dima asks:
I have a single rather large table with 18 columns.
Data in a table are continuous time stamped records.
On one hand, the application is appending new records to this table; on the other hand, users can do various queries.
My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?
There could be all sorts of combinations in
WHERE
clause, I don't want to limit users in their choices.Is there any common sense strategy to decide on indexes in my case?
Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.
Generally speaking, an index on (column1, column2, column3)
can be used for any kind of predicate that can be represented in the following form:
ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)
In other words, if all the records are sorted by column1, column2, column3
, the condition should select a contiguous block of such a sorted set of records.
All @value
's here can match and can be +Infinity
or -Infinity
.
Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are INTEGER
:
Read the rest of this entry »
Selecting first letters
From Stack Overflow:
I would like to produce a character list of all of the first letters of column in my database.
Is there a way to do this in MySQL?
Let's create a sample table of 1,000,000 records and fill it with random data:
Read the rest of this entry »