EXPLAIN EXTENDED

How to create fast database queries

Author Archive

Indexing a link table

Comments enabled. I *really* need your comment

From Stack Overflow:

A relation table is the common solution to representing a many-to-many (m:n) relationship.

In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:

CREATE TABLE t_a (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_b (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_ab (
a INT NOT NULL,
b INT NOT NULL
)

How should it be indexed to provide optimal performance in every JOIN situation?

I'll consider SQL Server as the target platform this time.

We need all WHERE, ORDER BY and GROUP BY conditions to be covered by the indexes.

On a first thought, we need to have indexes for all columns in all orders:
Read the rest of this entry »

Written by Quassnoi

May 12th, 2009 at 11:00 pm

Posted in SQL Server

PostgreSQL: emulating ROW_NUMBER

Comments enabled. I *really* need your comment

Note: this article concerns PostgreSQL 8.3 and below.

PostgreSQL 8.4 introduces window functions.

Window function ROW_NUMBER() implements the functionality in question more efficiently.

In one of the previous articles:

, I described emulating Oracle's pseudocolumn ROWNUM in PostgreSQL.

Now, we'll extend this query to emulate ROW_NUMBER.

A quick reminder: ROW_NUMBER is an analytical function in ANSI SQL 2003 supported by Oracle and MS SQL Server.

It enumerates each row in a resultset, but, unlike ROWNUM, may take two additional parameters: PARTITION BY and ORDER BY.

PARTITION BY splits a rowset into several partitions, each of them being numbered with its own sequence starting from 1.

ORDER BY defines the order the rows are numbered within each partition. This order may differ from the order the rows are returned in.

This function helps building queries which allow to select N rows for each partition.

Let's create a sample table and see how we do it in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

May 11th, 2009 at 11:00 pm

Posted in PostgreSQL

Creating indexes

Comments enabled. I *really* need your comment

Answering the questions asked on the site.

Dima asks:

I have a single rather large table with 18 columns.

Data in a table are continuous time stamped records.

On one hand, the application is appending new records to this table; on the other hand, users can do various queries.

My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?

There could be all sorts of combinations in WHERE clause, I don't want to limit users in their choices.

Is there any common sense strategy to decide on indexes in my case?

Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.

Generally speaking, an index on (column1, column2, column3) can be used for any kind of predicate that can be represented in the following form:

ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)

In other words, if all the records are sorted by column1, column2, column3, the condition should select a contiguous block of such a sorted set of records.

All @value's here can match and can be +Infinity or -Infinity.

Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are INTEGER:
Read the rest of this entry »

Written by Quassnoi

May 9th, 2009 at 11:00 pm

Posted in MySQL

Selecting first letters

with one comment

From Stack Overflow:

I would like to produce a character list of all of the first letters of column in my database.

Is there a way to do this in MySQL?

Let's create a sample table of 1,000,000 records and fill it with random data:
Read the rest of this entry »

Written by Quassnoi

May 8th, 2009 at 11:30 pm

Posted in MySQL

Checking event dates

Comments enabled. I *really* need your comment

From Stack Overflow:

Suppose the following table structure in Oracle:

CREATE TABLE event (
id INTEGER,
start_date DATE,
end_date DATE
)

Is there a way to query all of the events that fall on a particular day of the week?

For example, I would like to find a query that would find every event that falls on a Monday.

Figuring out if the start_date or end_date falls on a Monday is easy, but I'm not sure how to find it out for the dates between.

This is one of the range predicates which are very unfriendly to plain B-Tree indexes.

But even if there would be a range friendly index (like R-Tree), that would hardly be an improvement. Monday's make up 14.3% of all days, that means that an index if there were any, would have very low selectivity even for one-day intervals.

And if the majority of intervals last for more than one day, the selectivity of the condition yet decreases: 86% of 6-day intervals have a Monday inside.

Given the drawbacks of index scanning and joining on ROWID, we can say that a FULL TABLE SCAN will be a nice access path for this query, and we just need to represent it as an SQL condition (without bothering for its sargability)

We could check that a Monday is between end_date's day-of-week number and the range length's offset from this number:

SELECT  *
FROM    "20090507_dates".event
WHERE   6 BETWEEN MOD(start_date - TO_DATE(1, 'J'), 7) AND MOD(start_date - TO_DATE(1, 'J'), 7) + end_date - start_date

This query converts each ranges into a pair of zero-based, Tuesday-based day of week offsets, and returns all records which have day 6 (a Monday) inside the range.

Note that we don't use Oracle's TO_DATE('D') function here: starting day of week depends on NLS_TERRITORY which only leads to confusion.

Now, this query works but looks quite ugly. And if we will check for more complex conditions, it will become even uglier.

What if we need to find all ranges that contain a Friday, 13th? Or a second week's Thursday? The conditions will become unreadable and unmaintainable.

Can we do it in some more elegant way?

What if we just iterate over the days of the range and check each day for the condition? This should be much more simple than inventing the boundaries.

Let's create a sample table and try it:
Read the rest of this entry »

Written by Quassnoi

May 7th, 2009 at 11:00 pm

Posted in Oracle

Oracle: ROW_NUMBER vs ROWNUM

with 6 comments

From Stack Overflow:

I have an SQL query that looks something like this:

SELECT  *
FROM    (
SELECT  t.*, row_number() OVER (ORDER BY ID) rn
FROM    mytable t
)
WHERE   rn BETWEEN :start and :end

Essentially, it's the ORDER BY part that's slowing things down. If I were to remove it, the EXPLAIN cost goes down by an order of magnitude (over 1,000 times).

I've tried this:

SELECT  t.*, row_number() OVER (ORDER BY ID) rn
FROM    mytable t
WHERE   rownum BETWEEN :start and :end

, but this doesn't give correct results.

Is there any easy way to speed this up? Or will I have to spend some more time with the EXPLAIN tool?

First, just a quick reminder on how ROWNUM works. From Oracle's documentation:

Conditions testing for ROWNUM values greater than a positive integer are always false. For example, this query returns no rows:

SELECT  *
FROM    employees
WHERE   ROWNUM &gt; 1

The first row fetched is assigned a ROWNUM of 1 and makes the condition false. The second row to be fetched is now the first row and is also assigned a ROWNUM of 1 and makes the condition false. All rows subsequently fail to satisfy the condition, so no rows are returned.

That's why the second query should look like this:

SELECT  *
FROM    (
SELECT  t.*, ROWNUM AS rn
FROM    mytable t
ORDER BY
paginator, id
)
WHERE   rn BETWEEN :start and :end

Now, let's see the performance. To do this, we'll create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 6th, 2009 at 11:00 pm

Posted in Oracle

PostgreSQL: row numbers

with 2 comments

Note: this article concerns PostgreSQL 8.3 and below.

PostgreSQL 8.4 introduces window functions.

Window function ROW_NUMBER() implements the functionality in question more efficiently.

ROWNUM is a very useful pseudocolumn in Oracle that returns the position of each row in a final dataset.

Upcoming PostgreSQL 8.4 will have this pseudocolumn, but as for now will we need a hack to access it.

The main idea is simple:

  1. Wrap the query results into an array
  2. Join this array with a generate_series() so that numbers from 1 to array_upper() are returned
  3. For each row returned, return this number (as ROWNUM) along the corresponding array member (which is the row from the original query)

Let's create a table with multiple columns of different datatypes, write a complex query and try to assign the ROWNUM to the query results:
Read the rest of this entry »

Written by Quassnoi

May 5th, 2009 at 11:00 pm

Posted in PostgreSQL

Assigning people

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Greg asks:

I have two tables, tasks and resources.

Tasks contains cost of each task: first requires 5 people to complete, second requires 3 people etc.

Resources is just a list of people.

How do I assign people to tasks?

This is in MySQL.

Thanks for a nice question, Greg.

Unfortunately, you haven't sent your table structure, so I will have to make it up.

Let's create the tables:
Read the rest of this entry »

Written by Quassnoi

May 4th, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL: optimizing DISTINCT

with 8 comments

In PostgreSQL (as of 8.3, at least), performance of DISTINCT clause in SELECT list is quite poor.

Probably because DISTINCT code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.

GROUP BY that can be used for the same purpose is more smart, as it employs more efficient HashAggregate, but its performance is still poor for large dataset.

All major RDBMS's, including MySQL, are able to jump over index keys to select DISTINCT values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT values.

This behavior can be emulated in PostgreSQL too.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

GROUP_CONCAT in PostgreSQL without aggregate functions

with one comment

In one of the previous articles:

Aggregate concatenation

, I described an aggregate function to concatenate strings in PostgreSQL, similar to GROUP_CONCAT in MySQL.

It's very useful if you have a complex GROUP BY query with multiple conditions.

But for some simple queries it's possible to emulate GROUP_CONCAT with pure SQL, avoiding custom functions at all.

Let's create a table to demonstrate our task:
Read the rest of this entry »

Written by Quassnoi

May 2nd, 2009 at 11:00 pm

Posted in PostgreSQL