# EXPLAIN EXTENDED

How to create fast database queries

## Finding loops in a tree

From Stack Overflow:

I have a view which matches up the needs and the wants, and discards the requests that cannot be fulfilled by any user, that looks like this:

Item Need Have
1 Bob George
2 George Herbie
3 Herbie Bob
4 Larry Wally
5 Wally Larry
6 John George

I am now trying to write a query where I can calculate the number of permutations of each user having their requested needs fulfilled.

For instance, in the example above, Bob, George, and Herbie can together fulfill each other's needs, as can Larry and Wally, but John cannot, so the total count would be 2.

Whenever there is a set of users that fulfill each other's needs, there is a loop in the tree.

We can assume that either of the columns (`need` or `have`) is unique so there there is always a single path in the corresponding direction which ends with at most one loop.

It's not stated in the task, but otherwise it would be an equivalent of Clique problem which is `NP` complete and hardly solvable with `SQL`.

Now, let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 18th, 2009 at 11:00 pm

Posted in Oracle

## Ranging random values

From Stack Overflow:

Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?

The ranges should not overlap and ideally have a similar amount of values in them.

This is a good job for SQL Server's analytic function `NTILE`.

This function, given a parameter `N`, assigns a rank from `1` to `N` to each value, just as in quadrile (4), quintile (5), percentile (100) etc.

Lower `1/N` of values get a rank of `1`, from `1/N` to `2/N` get a rank of 2, etc.

The value, though, are integer and random, and we can get gaps between the adjacent ranges.

To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.

Say, `NTILE` assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.

Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.

To do this, for each value selected, we will get also the previous one and the next one.

It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.

Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

May 17th, 2009 at 11:00 pm

Posted in SQL Server

## UNION vs UNION ALL

In my previous article:

I described efficient `ORDER BY` along with `LIMIT` for `UNION`'ed resultsets.

One thing left unanswered: why `UNION DISTINCT` in some cases is faster than `UNION ALL`?

`UNION DISTINCT` (for which a mere `UNION` is an alias) implies selecting `DISTINCT` values from the resultsets, while `UNION ALL` just selects all values, including duplicates.

On one hand, `UNION ALL` does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.

If some complex calculations are performed on these rows, this may kill all performance benefits.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

May 16th, 2009 at 11:00 pm

Posted in MySQL

## Ordering UNION’s

From Stack Overflow:

When I optimize my 2 single queries to run in less than 0.02 seconds and then `UNION` them the resulting query takes over 1 second to run.

Also, a `UNION ALL` takes longer than a `UNION DISTINCT`.

I would assume allowing duplicates would make the query run faster and not slower.

Am I really just better off running the 2 queries separately? I would prefer to use the `UNION`.

This kinds of things happen when the queries employ `ORDER BY` along with `LIMIT`.

Let's create the sample tables and see what is happening:
Read the rest of this entry »

Written by Quassnoi

May 15th, 2009 at 11:00 pm

Posted in MySQL

## PostgreSQL: selecting a function

From Stack Overflow:

Hello,

I want to write a `SELECT` statement as follows:

```SELECT  field_a
FROM    my_table
WHERE   field_b IN (my_function(field_c))
```

Is that possible?

Would `my_function` have to return an array?

It is of course possible in PostgreSQL using a set returning function.

Return type of such a function should be declared as `SETOF`, so the function will return a rowset of given type.

There are at least two ways to call this function, and in this article I will consider the benefits and drawbacks of each method.

Let's create sample tables:
Read the rest of this entry »

Written by Quassnoi

May 14th, 2009 at 11:00 pm

Posted in PostgreSQL

## Oracle: returning deleted records

From Stack Overflow:

I need to return a rowset from an Oracle procedure, and then delete them in that same procedure.

Is there a neat way of doing this without temp tables?

Something like an in-memory cursor maybe?

Oracle has a nice feature of returning cursor variables from a stored procedure or an anonymous block.

As any cursor, the returned cursor is a handle to a stable rowset.

This means that it's not affected by any changes to database that were made after the cursor was open.

We can exploit this feature to achieve the required behavior.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 13th, 2009 at 11:00 pm

Posted in Oracle

From Stack Overflow:

A relation table is the common solution to representing a many-to-many (`m:n`) relationship.

In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:

```CREATE TABLE t_a (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_b (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_ab (
a INT NOT NULL,
b INT NOT NULL
)
```

How should it be indexed to provide optimal performance in every JOIN situation?

I'll consider SQL Server as the target platform this time.

We need all `WHERE`, `ORDER BY` and `GROUP BY` conditions to be covered by the indexes.

On a first thought, we need to have indexes for all columns in all orders:
Read the rest of this entry »

Written by Quassnoi

May 12th, 2009 at 11:00 pm

Posted in SQL Server

## PostgreSQL: emulating ROW_NUMBER

PostgreSQL 8.4 introduces window functions.

Window function `ROW_NUMBER()` implements the functionality in question more efficiently.

In one of the previous articles:

, I described emulating Oracle's pseudocolumn `ROWNUM` in PostgreSQL.

Now, we'll extend this query to emulate `ROW_NUMBER`.

A quick reminder: `ROW_NUMBER` is an analytical function in `ANSI SQL 2003` supported by Oracle and MS SQL Server.

It enumerates each row in a resultset, but, unlike `ROWNUM`, may take two additional parameters: `PARTITION BY` and `ORDER BY`.

`PARTITION BY` splits a rowset into several partitions, each of them being numbered with its own sequence starting from 1.

`ORDER BY` defines the order the rows are numbered within each partition. This order may differ from the order the rows are returned in.

This function helps building queries which allow to select `N` rows for each partition.

Let's create a sample table and see how we do it in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

May 11th, 2009 at 11:00 pm

Posted in PostgreSQL

## Creating indexes

I have a single rather large table with 18 columns.

Data in a table are continuous time stamped records.

On one hand, the application is appending new records to this table; on the other hand, users can do various queries.

My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?

There could be all sorts of combinations in `WHERE` clause, I don't want to limit users in their choices.

Is there any common sense strategy to decide on indexes in my case?

Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.

Generally speaking, an index on `(column1, column2, column3)` can be used for any kind of predicate that can be represented in the following form:

`ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)`

In other words, if all the records are sorted by `column1, column2, column3`, the condition should select a contiguous block of such a sorted set of records.

All `@value`'s here can match and can be `+Infinity` or `-Infinity`.

Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are `INTEGER`:
Read the rest of this entry »

Written by Quassnoi

May 9th, 2009 at 11:00 pm

Posted in MySQL

## Selecting first letters

with one comment

From Stack Overflow:

I would like to produce a character list of all of the first letters of column in my database.

Is there a way to do this in MySQL?

Let's create a sample table of 1,000,000 records and fill it with random data:
Read the rest of this entry »

Written by Quassnoi

May 8th, 2009 at 11:30 pm

Posted in MySQL