## Archive for September, 2009

## IN vs. JOIN vs. EXISTS: Oracle

Answering questions asked on the site.

**Jason** asks:

I have a co-worker who swears that

Oracle`IN`

queries are slow and refuses to use them. For example:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.

So, in general is an

`IN`

query very expensive? I'm having trouble putting together many non-trivial queries to run on`EXPLAIN PLAN`

.

Does the `IN`

predicate always have inferior efficiency compared to it's counterparts, `EXISTS`

and `JOIN`

?

Let's check.

First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:

`IN`

predicate:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )

`EXISTS`

predicate:SELECT foo FROM bar WHERE EXISTS ( SELECT NULL FROM asdf WHERE asdf.stuff = bar.stuff )

`JOIN`

/`DISTINCT`

:SELECT b.foo FROM ( SELECT DISTINCT stuff FROM asdf ) a JOIN bar b ON b.stuff = a.stuff

All these queries are semantically the same.

Common wisdom advices against using `IN`

predicates because it just doesn't look a right thing to do.

This is because less experienced **SQL** developers tend to translate the **SQL** statements into pseudocode just like they see it, which in this case looks something like this:

foreach ($bar as $bar_record) { foreach ($asdf as $asdf_record) { if ($bar_record->stuff == $asdf_record->stuff) output ($bar_record); break; } }

This just *radiates* inefficiency, since the inner rowset should be iterated for each row returned.

`EXISTS`

looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.

Finally, the `JOIN`

looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an `IN`

? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!

People, come on. **Oracle** developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an `IN`

construct.

Now, let's make two sample tables and see how it's done:

Read the rest of this entry »

## Adjacency list vs. nested sets: MySQL

Continuing the series:

What is better to store hierarchical data:

nested setsmodel oradjacency list(parent-child) model?

For detailed explanations of the terms, see the first article in the series:

This is the last article of the series which covers **MySQL**.

**MySQL** differs from the other systems, since it is the only system of the big four that does not support recursion natively. It has neither recursive **CTE**'s nor `CONNECT BY`

clause, not even rowset returning functions that help to emulate recursion in **PostgreSQL 8.3**.

**MySQL** supports a thing that all other systems either lack or implement inefficiently: session variables. They can be set in a `SELECT`

clause and can be used to keep some kind of a state between the rows as they are processed and returned in a rowset.

This of course is against the whole spirit of **SQL**, since **SQL** implies operations on whole sets and session variables operate on rows and are totally dependent on the order they are returned or processed. But if used properly, this behavior can be exploited to emulate some things that **MySQL** lacks: analytic functions, efficient random row sampling etc.

Hierarchical functions are among the things that need to be emulated in **MySQL** using session variables to keep the function state.

Here's the old article in my blog that shows how to do this:

On the other hand, **MySQL** implements one more thing that is useful for **nested sets** model: `SPATIAL`

indexes.

Read the rest of this entry »

## Adjacency list vs. nested sets: Oracle

Comments enabled. I *really* need your comment

Continuing the series:

What is better to store hierarchical data:

nested setsmodel oradjacency list(parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Today is **Oracle** time.

Adjacent sets require recursion, and **Oracle** has a native way to implement recursion in **SQL** statements. To do this it uses a special clause, `CONNECT BY`

.

Unlike other systems, **Oracle** always supported recursion in **SQL** statements since **version 2** (this is the name the first version of **Oracle** was released under).

This was done because by that time there already were several hierarchical databases on the market which are quite good for storing hierarchical data. If not for this clause, transition from a hierarchical **DBMS** to the relational **DBMS** would be too hard.

**Oracle**'s way to query self-relations is somewhat different from recursive **CTE**'s that other systems use:

- Recursive
**CTE**s can use an arbitrary set in recursive steps. The recursion base and recursion stack are not visible in further steps. The recursive operation is a set operation (usually a`JOIN`

) on the recursion parameter (a set) and the result of the previous operation: `CONNECT BY`

queries can only use the value of a single row in recursive steps.The recursion base and recursion stack are visible, the values outside the recursion stack are not.

The recursive operation is a set operation on each row of the current row and the base rowset.

Unlike recursive

**CTE**,`CONNECT BY`

operates row-wise, that is each row of the result spans its own recursion branch with its own stack:

The main difference it that the `CONNECT BY`

operation cannot produce anything that was not in the original rowset, while a recursive **CTE** can.

However, `CONNECT BY`

has numerous benefits too: it's very fast, allows to track the hierarchy path easily (using a built-in funtion named `SYS_CONNECT_BY_PATH`

) and implicitly sorts the recordset in tree order (with additional clause `ORDER SIBLINGS BY`

to sort the siblings).

This method, of course, allows traversing adjacency lists easily.

To compare efficiency of adjacency lists and that of nested sets, let's create a sample table:

Read the rest of this entry »

## Adjacency list vs. nested sets: SQL Server

Continuing the series:

What is better to store hierarchical data:

nested setsmodel oradjacency list(parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Now, let's see what's better for **SQL Server**.

We will create a single table that holds both adjacency list data and nested sets data, with **8** levels of nesting, **5** children of each parent node and **2,441,405** records:

Read the rest of this entry »

## Adjacency list vs. nested sets: PostgreSQL

This series of articles is inspired by numerous questions asked on the site and on **Stack Overflow**.

What is better to store hierarchical data:

nested setsmodel oradjacency list(parent-child) model?

First, let's explain what all this means.

### Adjacency list

Hierarchical relations (not to be confused with hierarchical data model) are `0-1:0-N`

transitive relations between entities of same domain.

For instance, ancestor-descendant relation is:

- Transitive:
- If
**A**is an ancestor of**B**and**B**is an ancestor of**C**, then**A**is an ancestor of**C**

- If
- Irreflexive:
- If
**A**is an ancestor of**B**, then**B**is never an ancestor of**A**

- If
`0-1:0-N`

**A**can have zero, one or many children.**A**can have zero or one parents.

These relations can be represented by an ordered directed tree.

Tree is a simple directed graph (that with at most one directed edge between two different vertices) and relational model has means to represent simple graphs.

Two vertices are considered related (and therefore their primary keys forming a row in the table) if and only if they are connected with an edge.

This table along with the table defining the vertices identifies a graph completely by defining pairs of vertices connected by the edges. Each record in the table defines a pair of adjacent vertices, that's why this representation is called **adjacency list**.

Adjacency lists can represent any simple directed graphs, not ony hierarchy trees. But due to the fact that this structure is most commonly used to define the parent-child relationships, the terms parent-child model

and adjacency list model

have almost become synonymous. However, they are not: adjacency list model is much wider and parent-child model is one of its implementations.

Now, since we have a tree here which implies `0-1:0-N`

relationship between the vertices, we can define the relation as a *self-relation*: the table defines both the entity and the relationship. Parent is just a one attribute among other attributes with a `FOREIGN KEY`

reference to the table itself.

Since multple items can have no parents (and therefore be the roots of their trees), it's sometimes useful to convert this tree into an arborescence: make a single fake root that considered a parent of all entries that have no actual parent.

This is a nice and elegant model, but until recently it had one drawback: it could not be used with **SQL**.

Read the rest of this entry »

## Building alphabetical index

Comments enabled. I *really* need your comment

Answering questions asked on the site.

**Cora** asks:

I want to build an alphabetical index over the articles in my database.

I need to show

5entries in the index, like this:`A-D`

,`E-F`

etc., so that each entry contains roughly equal number of articles.How do I do it in a single query?

This is in

Oracle

This is a good task to demonstrate **Oracle**'s analytic abilities.

To do this, we will create a sample table:

Read the rest of this entry »

## Finding sets within a time frame

Comments enabled. I *really* need your comment

From **Stack Overflow**:

I've got a set of data that contains users, their awards and the time they are granted.

When a user earns an award, the award and user is logged along with a date.

What I'd like is a query that finds a set of

3achievements that were earned within5minutes of each other by the same user.Any thoughts on how to accomplish this?

I'll go a step further and try to build a general solution to find sets of `N`

values within a given time frame.

We should return all rows that are parts of all 5-minute time frames containing at least `N`

values. These timeframes can overlap.

We should build such a timeframe for each record, make sure that there are at least `N`

values in this timeframe, and if yes, return these values (applying `DISTINCT`

to get rid of the duplicates).

Let's create a sample table and build a query to do this:

## Rounding numbers preserving their sum

Comments enabled. I *really* need your comment

From **Stack Overflow**:

I'm looking for a way to round every value contained in a column without changing the total sum (which is known to be an integer, or very close to an integer because of float number precision).

Rounding every value to the nearest integer won't work (ex :

(1.5, 1.5)will be rounded to either(1, 1)or(2, 2))It's quite easy to do this using several requests (e. g. storing the original sum, rounding, computing the new sum, and updating as many rows as needed to go back to the original sum), but this is not a very elegant solution.

Is there a way to do this using a single

SQLrequest?I'm using

SQL Server 2008, so solutions taking advantage of this specific vendor are welcome.I'm looking for a request minimizing the differences between the old values and the new ones. In other words, a value should never be rounded up if a greater value has been rounded down, and vice-versa

We have two restrictions here:

- The sum of the rounded numbers should be equal to that of the unrounded ones.
- When rounding, no number with a greater fractional part can be rounded down if there is a number with a smaller fractional part that had been rounded up.

We cannot round **(1.4, 1.4, 1.2)** to **(1, 1, 2)**, since **1.4** would be rounded down while **1.2** (with smaller fractional part) is rounded up. However, we can round them to **(1, 2, 1)** or **(1, 2, 1)**, since nothing prevents us from rounding numbers with *equal* fractional part differently.

Given this, we should build a query that would take a set of numbers and round them taking these two restrictions into account.

It is easily seen than to satisfy the second restriction a certain rounding threshold should exist: all numbers with the fractional part greater than this threshold are rounded up, those with the fractional part smaller than then threshold are rounded down, those with the fractional part equal to the threshold can be rounded both ways.

However, calculating this threshold is, first, quite a difficult task and, second, we still don't know what to do with the numbers having the fractional part equal to the threshold.

Instead, we use another approach.

Let's create a sample tables and show this approach:

## NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

Which method is best to select values present in one table but missing in another one?

This:

SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:

SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:

SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )

Finally, it's **MySQL** time.

As always, we will create the sample tables:

## NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle

Which method is best to select values present in one table but missing in another one?

This:

SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:

SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:

SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )

Today, we will see how Oracle copes with these queries.

And to do this, we, of course, should create sample tables: