Author Archive
Genealogy query on both parents
Comments enabled. I *really* need your comment
From Stack Overflow:
I'm trying to fetch a genealogy tree of animals from my Oracle database.
Here's the table:
Animal_ID Parent_Male_ID Parent_Female_ID If I specify an animal, I can get all of its descendants (on the male side) using something like this:
SELECT * FROM animal START WITH animal_id = 123 CONNECT BY PRIOR animal_id = parent_male_idI'm trying to find a way to extend this in such a way that if I specify an animal, it will fetch both parents and then will fetch all of their descendants.
Any thoughts?
This is possible to do with a same CONNECT BY
query by providing a little more comlex condition for START WITH
and CONNECT BY
:
SELECT DISTINCT(animal_id) AS animal_id FROM animal START WITH animal_id IN ( SELECT father FROM animal WHERE animal_id = 9500 UNION ALL SELECT mother FROM animal WHERE animal_id = 9500 ) CONNECT BY PRIOR animal_id IN (father, mother) ORDER BY animal_id
However, my gut feeling says this query need performance checking.
Let's create a sample table:
Read the rest of this entry »
Ordering columns independently
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table with 5 columns in it.
What's the easiest way to select all rows, but where each column is individually randomized?
All I can think of is to select each column separately, along with
ROW_NUMBER() OVER (ORDER BY NEWID()) AS lookup
, and then join each column back together on lookup.
Is there an easier way?
Though it's possible to do this by means of JOIN
'ing, there is an easier way.
Since all SQL
operations are row-level, we will need to make 5 independent ordered sets, which we later can use to order each column set. This implies having 5 times as much rows as there are rows in the original table.
But instead of JOIN
'ing, we can use SQL Server's clause which maps rows to columns, which is called PIVOT
.
We need to do the following:
- Generate 5 sets of column values and
UNION ALL
them. - For each set, select a
ROW_NUMBER() OVER (ORDER BY NEWID())
along with each row. It will be unique for each row in any set and will give a row identifier forPIVOT
'ing. - For each set, select a set identifier: just a string with column name. This will give a column identifier for
PIVOT
'ing - Finally,
PIVOT
the column values: they will get into appropriate row and column, based onROW_NUMBER()
and the set name
Since PIVOT
requires aggregation (in case row and column identifiers are not unique), we need some aggregation function, but our query is designed so that there is exactly one row with a given ROW_NUMBER()
and set name, so any aggregation function will do.
We will use MIN
.
Now, let's create a sample table and see how it works:
Read the rest of this entry »
Choosing column order
From Stack Overflow:
I know you can
ALTER
the column order in MySQL withFIRST
andAFTER
, but why would you want to bother?Since good queries explicitly name columns when inserting data, is there really any reason to care what order your columns are in in the table?
Well, there is reason.
MySQL stores some types (out of which VARCHAR
is probably the most important) with variable length.
That means that each variable field stored in the row is prepended with its actual length, like in Pascal string type.
Since the rows have no TOC, it means that the engine should traverse all previous values to find out where the required value begins.
If we need the value of the first column (first non-PRIMARY KEY
column, of course), the engine will do it once per row.
But if we need the value of, say, 20-th column, the engine should find out the lengths of previous 19 columns and add them together.
Let's create a sample table and see how big the impact is:
Read the rest of this entry »
Oracle: matching SYS_GUID’s
Answering questions asked on the site.
Habib asks:
I have a table with a
PRIMARY KEY
defined asConsumerID RAW(16) NOT NULL
.However, when I issue a query like this:
SELECT * FROM Consumers WHERE ConsumerID = '1DFFC45DEDEB42B8B79B060D544C2ACB', it uses a
TABLE ACCESS FULL
and is very slow.How can I improve this query?
Defining a PRIMARY KEY
column as RAW(16)
almost always means using system generated GUID
's returned by SYS_GUID
as PRIMARY KEY
's.
Let's create a sample table and see what's going on:
Read the rest of this entry »
Finding loops in a tree
From Stack Overflow:
I have a view which matches up the needs and the wants, and discards the requests that cannot be fulfilled by any user, that looks like this:
Item Need Have 1 Bob George 2 George Herbie 3 Herbie Bob 4 Larry Wally 5 Wally Larry 6 John George I am now trying to write a query where I can calculate the number of permutations of each user having their requested needs fulfilled.
For instance, in the example above, Bob, George, and Herbie can together fulfill each other's needs, as can Larry and Wally, but John cannot, so the total count would be 2.
This task is actually about finding loops.
Whenever there is a set of users that fulfill each other's needs, there is a loop in the tree.
We can assume that either of the columns (need
or have
) is unique so there there is always a single path in the corresponding direction which ends with at most one loop.
It's not stated in the task, but otherwise it would be an equivalent of Clique problem which is NP
complete and hardly solvable with SQL
.
Now, let's create a sample table:
Read the rest of this entry »
Ranging random values
Comments enabled. I *really* need your comment
From Stack Overflow:
Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?
The ranges should not overlap and ideally have a similar amount of values in them.
This is a good job for SQL Server's analytic function NTILE
.
This function, given a parameter N
, assigns a rank from 1
to N
to each value, just as in quadrile (4), quintile (5), percentile (100) etc.
Lower 1/N
of values get a rank of 1
, from 1/N
to 2/N
get a rank of 2, etc.
The value, though, are integer and random, and we can get gaps between the adjacent ranges.
To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.
Say, NTILE
assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.
Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.
To do this, for each value selected, we will get also the previous one and the next one.
It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.
Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.
Let's create a sample table and see how it works:
Read the rest of this entry »
UNION vs UNION ALL
Comments enabled. I *really* need your comment
In my previous article:
I described efficient ORDER BY
along with LIMIT
for UNION
'ed resultsets.
One thing left unanswered: why UNION DISTINCT
in some cases is faster than UNION ALL
?
UNION DISTINCT
(for which a mere UNION
is an alias) implies selecting DISTINCT
values from the resultsets, while UNION ALL
just selects all values, including duplicates.
On one hand, UNION ALL
does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.
If some complex calculations are performed on these rows, this may kill all performance benefits.
Let's create the sample tables:
Read the rest of this entry »
Ordering UNION’s
Comments enabled. I *really* need your comment
From Stack Overflow:
When I optimize my 2 single queries to run in less than 0.02 seconds and then
UNION
them the resulting query takes over 1 second to run.Also, a
UNION ALL
takes longer than aUNION DISTINCT
.I would assume allowing duplicates would make the query run faster and not slower.
Am I really just better off running the 2 queries separately? I would prefer to use the
UNION
.
This kinds of things happen when the queries employ ORDER BY
along with LIMIT
.
Let's create the sample tables and see what is happening:
Read the rest of this entry »
PostgreSQL: selecting a function
Comments enabled. I *really* need your comment
From Stack Overflow:
Hello,
I want to write a
SELECT
statement as follows:SELECT field_a FROM my_table WHERE field_b IN (my_function(field_c))Is that possible?
Would
my_function
have to return an array?
It is of course possible in PostgreSQL using a set returning function.
Return type of such a function should be declared as SETOF
, so the function will return a rowset of given type.
There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.
Let's create sample tables:
Read the rest of this entry »
Oracle: returning deleted records
Comments enabled. I *really* need your comment
From Stack Overflow:
I need to return a rowset from an Oracle procedure, and then delete them in that same procedure.
Is there a neat way of doing this without temp tables?
Something like an in-memory cursor maybe?
Oracle has a nice feature of returning cursor variables from a stored procedure or an anonymous block.
As any cursor, the returned cursor is a handle to a stable rowset.
This means that it's not affected by any changes to database that were made after the cursor was open.
We can exploit this feature to achieve the required behavior.
Let's create a sample table:
Read the rest of this entry »