EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘Oracle’ Category

Happy New Year: drawing clock in SQL

with 9 comments

Some say SQL is not good at graphics.

Well, they have some point. Database engines lack scanner drivers, there is no easy way to do sepia, and magic wand, let's be honest, is just poor.

However, you can make some New Year paintings with SQL.

Let's make a New Year clock showing 12 o'clock in Oracle.

Read the rest of this entry »

Written by Quassnoi

December 31st, 2010 at 11:00 pm

Posted in Oracle

Tagged with , ,

Counting concurrent sessions

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Steve asks:

I am trying to query a log table for a web service application and determine how many concurrent sessions are in progress at each moment a transaction is executed, based on a start date and an elapsed time for each command executed through the web service. (These metrics are logged after the fact, I'm trying to write daily performance reporting for the site).

Here's a simplified view of my base table design:

CREATE TABLE CONNECTIONS
(
USERID VARCHAR2(30),
HANDLE VARCHAR2(40),
PROCESS_DATE DATE,
COMMAND NUMBER(6,0),
COUNT NUMBER(10,0),
ELAPSED_TIME NUMBER(10,0),
NUM_OF_RECS NUMBER(10,0),
COMMAND_TIMESTAMP TIMESTAMP (6)
)

The question is: at there moment each transaction started, how many other transactions were active?

At each given moment, there is some number of active transaction. A transaction is active if the transaction begins before that moment and ends after it. This means that the moment should fall between command_timestamp and command_timestamp + elapsed_time / 86400000.

Database B-Tree indexes are not very good in queries that involve searching for a constant between two columns, so a self-join on the condition described above would be possible but not very efficient.

But these is a more simple solution.

Whenever a transaction starts, it increments the count of the open transactions. Whenever the transaction ends, it decrements it.

So we just can build a table of events: starts and ends of the transactions, ordered chronologically. Each start would be denoted with a +1, and each end with a -1. Then we should just calculate the number of the transactions open so far and subtract the number of the transactions closed.

This can be easily done merely by calculating the partial sum of these +1's and -1's, which is an easy task for Oracle's analytic functions.

Let's create a sample table. I'll put only the relevant columns there and add a stuffing column that would emulate actual payload, to measure performance:
Read the rest of this entry »

Written by Quassnoi

January 25th, 2010 at 11:00 pm

Posted in Oracle

Cumulative values

Comments enabled. I *really* need your comment

From Stack Overflow:

There is a table that stores signal statuses that come from different devices.

  • SS1 and SS2 signals are inserted to table in random times
  • If either of SS1 and SS2 signal statuses is up, then the resulting signal should be up
  • If both SS1 and SS2 signal statuses are down, then resulting signal should be down

I want to prepare a query that shows the result signal status changes according to SS1 and SS2 signals

Each record deals with only one signal type here: either SS1 or SS2. To obtain the signal statuses we should query the cumulative values of previous records.

If a record describes a change in SS2, we should query for the most recent change to SS1 that had been recorded so far to obtain the SS1's current status.

In systems other than Oracle, the previous value of a signal could be easily queried using subselects with TOP / LIMIT clauses. But Oracle does not support correlated queries nested more than one level deep, and limiting a subquery result to a single record (which is required by a subquery) requires it (ORDER BY should be nested). This makes constructing such a subquery in Oracle quite a pain.

However, in Oracle, these things can be queries using analytics functions much more efficiently.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

January 20th, 2010 at 11:00 pm

Posted in Oracle

Oracle: joining timestamps and time intervals

with one comment

From Stack Overflow:

I am making an inner join of two tables.

I have many time intervals represented by intervals and i want to get measure data from measures only within those intervals. The intervals do not overlap.

Here are the table layouts:

intervals
entry_time exit_time
measures
time measure

There are 1295 records in intervals and about a million records in measures. The intervals do not overlap.

The result I want to get is a table with in the first column the measure, then the time the measure has been done, the begin/end time of the considered interval (it would be repeated for row with a time within the considered range)

How can I make this faster?

Straight query

The query to get the results would look like this:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time ASC

This looks quite simple. However, choosing the execution plan can be somewhat tricky task for this query.

The table layout gives us no hint on how many rows will be returned from the join. If all intervals begin before the time of the first measure and end after the time of the last measure, then every combination of rows will be returned from both tables and the resulting recordset will in fact be a cartesian product of two tables. If the intervals are short and sparse, few or even no rows can be returned from measures.

However, we know that the intervals do not overlap. This means that each measure may belong to at most one interval and the number of records in the final resultset will be no more than the number of records in measures.

Since the condition we join here is a pair of inequalities, only two methods can be used by the engine to perform the join, that is NESTED LOOPS and MERGE JOIN. HASH JOIN which is most efficient on large tables requires an equijoin and cannot be used for this query.

MERGE JOIN sorts both resultsets on the columns it joins on and gradually advances the internal pointer on both sorted row sources so that the values of the columns being joined on in both tables always match. In case of an equijoin, the engine returns only the records holding the current value of the pointer; in case of a join on inequality condition, the engine returns all records greater (or less) than the current value from the corresponding table.

MERGE JOIN, however, can only satisfy a single inequality condition while we here have two of them on two different columns. The engine should split this condition into the pair of inequalities: the first one will be used by the join; the second one will be just filtered. The query essentially turns into this one:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time >= intervals.entry_time
WHERE   measures.time  <= intervals.exit_time
ORDER BY
        time ASC


Here, the MERGE JOIN will be using the predicate in the ON clause and the filtering will use the predicate in the WHERE clause. The predicates are symmetrical and can easily be swapped.

However, the join will have to return all records to the filtering code. And with an inequality condition like the one we see above, there will be lots of records. If we take a normal situation: the interval bounds are more or less in accordance with the dates of the first and last measure and the intervals are distributed evenly, then the MERGE JOIN will return (1265 × 1,000,000) / 2 ≈ 600,000,000 records, each to be filtered on the next step.

From performance's point of view, this is hardly different from a cartesian join: in fact, it needs to process only as few as a half of the rows. These rows, however, need to be sorted beforehand (or taken from the index which is slow to traverse in its sort order), so this can actually even be slower than a cartesian join which does not require the rows to be in any certain order and can use an INDEX FAST FULL SCAN or a TABLE SCAN.

This means the only efficient way to run this query is using NESTED LOOPS. To benefit from this method, we should make an index on measures.time and convince Oracle to use it.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 28th, 2009 at 11:00 pm

Posted in Oracle

Oracle: Selecting records holding group-wise maximum

with one comment

It was quite a time since I posted my last article, and I have a bunch of unanswered questions that people asked me.

Sorry for not posting and answering for so long, had a few urgent things to do, and hope these breaks won't be so long in the future :)

Now, continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

Finally, Oracle.

We will create the very same table as we did for the previous systems:
Read the rest of this entry »

Written by Quassnoi

December 16th, 2009 at 11:00 pm

Posted in Oracle

Oracle: nested SUM

with one comment

From Stack Overflow:

Suppose I have this table

A B C D

Datatypes are not important.

I want to do this:

SELECT  a AS a1, b AS b1, c AS c1,
(
SELECT  SUM(d)
FROM    source
WHERE   a = a1
AND b = b1
) AS total
FROM    source
GROUP BY
a, b, c

, but I can't find a way (SQL Developer keeps complaining with FROM clause not found.)

Is there a way? Is it possible?

This is of course possible if we just alias the query and prepend the alias to the field:

SELECT  a, b, c,
(
SELECT  SUM(d)
FROM    source si
WHERE   si.a = so.a
AND si.b = so.b
CONNECT BY
16 >= level
)
FROM    source so
GROUP BY
a, b, c

This works well on this sample set of data:

WITH    source AS
(
SELECT  FLOOR(MOD(level - 1, 8) / 4) + 1 AS a,
FLOOR(MOD(level - 1, 4) / 2) + 1 AS b,
MOD(level - 1, 2) + 1 AS c,
level AS d
FROM    dual
CONNECT BY
16 >= level
)
SELECT  a, b, c,
(
SELECT  SUM(d)
FROM    source si
WHERE   si.a = so.a
AND si.b = so.b
)
FROM    source so
GROUP BY
a, b, c

 
View query details

, but it needs to reevaluate the subquery for each group.

In Oracle, there is a better way to do this query: nesting an aggregate SUM inside an analytical SUM.
Read the rest of this entry »

Written by Quassnoi

October 8th, 2009 at 11:00 pm

Posted in Oracle

IN vs. JOIN vs. EXISTS: Oracle

with 11 comments

Answering questions asked on the site.

Jason asks:

I have a co-worker who swears that Oracle IN queries are slow and refuses to use them. For example:

SELECT  foo
FROM    bar
WHERE   bar.stuff IN
(
SELECT  stuff
FROM    asdf
)

Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.

So, in general is an IN query very expensive? I'm having trouble putting together many non-trivial queries to run on EXPLAIN PLAN.

Does the IN predicate always have inferior efficiency compared to it's counterparts, EXISTS and JOIN?

Let's check.

First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:

  • IN predicate:
    SELECT  foo
    FROM    bar
    WHERE   bar.stuff IN
    (
    SELECT  stuff
    FROM    asdf
    )
    
  • EXISTS predicate:
    SELECT  foo
    FROM    bar
    WHERE   EXISTS
    (
    SELECT  NULL
    FROM    asdf
    WHERE   asdf.stuff = bar.stuff
    )
    
  • JOIN / DISTINCT:
    SELECT  b.foo
    FROM    (
    SELECT  DISTINCT stuff
    FROM    asdf
    ) a
    JOIN    bar b
    ON      b.stuff = a.stuff
    

All these queries are semantically the same.

Common wisdom advices against using IN predicates because it just doesn't look a right thing to do.

This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:

foreach ($bar as $bar_record) {
foreach ($asdf as $asdf_record) {
if ($bar_record->stuff == $asdf_record->stuff)
output ($bar_record);
break;
}
}

This just radiates inefficiency, since the inner rowset should be iterated for each row returned.

EXISTS looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.

Finally, the JOIN looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!

People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN construct.

Now, let's make two sample tables and see how it's done:
Read the rest of this entry »

Written by Quassnoi

September 30th, 2009 at 11:00 pm

Posted in Oracle

Adjacency list vs. nested sets: Oracle

Comments enabled. I *really* need your comment

Continuing the series:

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Today is Oracle time.

Adjacent sets require recursion, and Oracle has a native way to implement recursion in SQL statements. To do this it uses a special clause, CONNECT BY.

Unlike other systems, Oracle always supported recursion in SQL statements since version 2 (this is the name the first version of Oracle was released under).

This was done because by that time there already were several hierarchical databases on the market which are quite good for storing hierarchical data. If not for this clause, transition from a hierarchical DBMS to the relational DBMS would be too hard.

Oracle's way to query self-relations is somewhat different from recursive CTE's that other systems use:

  • Recursive CTEs can use an arbitrary set in recursive steps. The recursion base and recursion stack are not visible in further steps. The recursive operation is a set operation (usually a JOIN) on the recursion parameter (a set) and the result of the previous operation:

    Recursive CTE

  • CONNECT BY queries can only use the value of a single row in recursive steps.

    The recursion base and recursion stack are visible, the values outside the recursion stack are not.

    The recursive operation is a set operation on each row of the current row and the base rowset.

    Unlike recursive CTE, CONNECT BY operates row-wise, that is each row of the result spans its own recursion branch with its own stack:

    CONNECT BY

The main difference it that the CONNECT BY operation cannot produce anything that was not in the original rowset, while a recursive CTE can.

However, CONNECT BY has numerous benefits too: it's very fast, allows to track the hierarchy path easily (using a built-in funtion named SYS_CONNECT_BY_PATH) and implicitly sorts the recordset in tree order (with additional clause ORDER SIBLINGS BY to sort the siblings).

This method, of course, allows traversing adjacency lists easily.

To compare efficiency of adjacency lists and that of nested sets, let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

September 28th, 2009 at 11:00 pm

Posted in Oracle

Building alphabetical index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Cora asks:

I want to build an alphabetical index over the articles in my database.

I need to show 5 entries in the index, like this: A-D, E-F etc., so that each entry contains roughly equal number of articles.

How do I do it in a single query?

This is in Oracle

This is a good task to demonstrate Oracle's analytic abilities.

To do this, we will create a sample table:
Read the rest of this entry »

Written by Quassnoi

September 23rd, 2009 at 11:00 pm

Posted in Oracle

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle

with 7 comments

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right r
)

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

Today, we will see how Oracle copes with these queries.

And to do this, we, of course, should create sample tables:

Read the rest of this entry »

Written by Quassnoi

September 17th, 2009 at 11:00 pm

Posted in Oracle