EXPLAIN EXTENDED

How to create fast database queries

Author Archive

Schema junk

with one comment

Many tasks in SQL world can be efficiently performed using stored procedures.

All database management systems of the Big Four (that is Oracle, SQL Server, MySQL and PostgreSQL) have a concept of a stored procedure: a little server-managed program written in a procedural language that allows running SQL statements and returning results.

They are widely recommended for use, since they allow easy separation of business rules from data storage rules, fine-grained security, integrity control and all other things software analysts mumble for their living.

People widely accept this theory and if you ask anyone should I use the stored procedures?, the answer you will get will be a unanimous yes.

The problem is that people rarely use them in fact in real applications. Everybody knows they should but nobody really does.

More than that: despite the fact that stored procedures in fact can be more efficient than plain queries (even for one-time tasks), people often prefer the latter. Writing a procedure to obtain a resultset seems to be an overkill from the point of view of most developers, they rather prefer doing it with a plain SQL query.

Why so?

Let me tell you another boring story. Don't worry, it won't take long.
Read the rest of this entry »

Written by Quassnoi

July 5th, 2009 at 11:00 pm

Posted in Miscellaneous

Searching for NULL columns

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to make sure I'm not inserting a duplicate row into my table (i. e. only PRIMARY KEY different).

All my fields allow NULL's as I've decided NULL to mean all values.

Because of NULL's, the following statement in my stored procedure can't work:

IF EXISTS
        (
        SELECT  *
        FROM    MY_TABLE
        WHERE   MY_FIELD1 = @IN_MY_FIELD1
                AND MY_FIELD2 = @IN_MY_FIELD2
                AND MY_FIELD3 = @IN_MY_FIELD3 
                AND MY_FIELD4 = @IN_MY_FIELD4
                AND MY_FIELD5 = @IN_MY_FIELD5
                AND MY_FIELD6 = @IN_MY_FIELD6
        )
BEGIN
        goto on_duplicate
END

, since NULL = NULL is not true.

How can I check for the duplicates without having an IF ISNULL statement for every column?

First of all, can't help being a smartass and tell that a UNIQUE index should be defined over these columns.

SQL Server does index NULL values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.

Now, back to business. Of course there are cases when creating such an index is not acceptable.

In these cases a very nice but little known operator INTERSECT comes handy.

This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server's understands common: each value in each column should be either equal to one in the other row or they both should be NULL's.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 4th, 2009 at 11:00 pm

Posted in SQL Server

Overlapping ranges: PostgreSQL

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Finally, PostgreSQL.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)

Now, let's create a sample table in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

July 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

Overlapping ranges: Oracle

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Today, let's see it in Oracle.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

is more index friendly.

Let's create a sample table and see how Oracle copes with these conditions:
Read the rest of this entry »

Written by Quassnoi

July 2nd, 2009 at 11:00 pm

Posted in Oracle

Overlapping ranges: MySQL

with one comment

Continuing yesterday's post.

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Now, let's see how do it in MySQL.

In the previous post we saw that SQL Server is not very good in optimizing this construct:

start_date < @week_end AND end_date > @week_start

and using it leads to a full table scan. A more complex condition is more efficient in SQL Server:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

, since two different indexes can be applied to both parts of the condition.

However, the second condition is intrinsically not very efficient which is a problem for almost any datetime range analysis.

A plain B-Tree index is efficient for queries like this:

constant1 < column < constant1

, but not for the queries like this:

column1 < constant < column2

The latter condition is in fact rewritten as following by the optimizer:

column1 < constant AND column2 > constant

and the optimizer is forced to choose between two non-optimal options: either stick to one index and do the filtering (using table lookups) or perform two index lookups and merge the results.

If 1,000,000 rows satisfy the first condition and 1,000,000 satisfy the second one, the engine needs to inspect either 1,000,000 rows with table lookups or 2,000,000 rows using indexes, only to find out that the intersection contains but 10 rows.

Fortunately, MySQL supports R-Tree indexes (called SPATIAL in MySQL because they are primarily used to query spatial data).

Indexes of this kind are particularly good for the queries similar to the one above, and with a little effort they can be used to query time data too.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 1st, 2009 at 11:00 pm

Posted in MySQL

Overlapping ranges: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

This query needs to find all events that overlap the week.

The condition for overlapping ranges is well known and quite simple:

start_date < @week_end AND end_date > @week_start

However, in SQL more simple doesn't always mean more efficient, and the following condition, despite being more complex, can yield better performance:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

Let's create a sample table and see how so:
Read the rest of this entry »

Written by Quassnoi

June 30th, 2009 at 11:00 pm

Posted in SQL Server

The Island of Misfit Cursors

with 3 comments

When I was in the elementary school, I once visited my friend.

His mother greeted us, invited us inside and told us to wash our hands and go to the kitchen. Then she turned on a giant electrical samovar to make us some tea.

When it boiled, she switched it off and the friend took the teapot to fill it with hot water. Wait, said his mother. Didn't you forget something? Sorry, mum, said he, took a long steel spoon and stirred the water inside the samovar with it.

What's that you are doing? asked I. His mom replied: It's good for the samovar. It will live longer if you stir water with a metal spoon after it boiled, the manual says so.

Never knew about it, but OK, good to know.

About a year later my parents and I went to visit my grandmother. We kissed hello and she put exactly same electrical samovar on to boil. When it boiled, I turned it off and took a spoon to stir water. Wait! the granny shouted. What are you going to do, kill yourself? You don't put metal things inside the electric appliances!

I was quite confused. My friend's mom adviced to do this, and the friend did, and everything is OK. On the other had, granny's statement also made sense: metal is a bad thing to have between you and electricity.

I was 10 or so that time so I forgot that case quickly.

But when the grandma died, she left some of her belongings to me. I went back to her old house and found the box with the samovar. I recalled the spoon and got curious. The box had a manual inside. I opened the page with the warnings and among other things found the following: With this new model, it is safe to stir water using metal objects.

As you may know, people never read manuals, and when they do, they get them wrong.
Read the rest of this entry »

Written by Quassnoi

June 28th, 2009 at 11:00 pm

Posted in Miscellaneous

Efficient INSERT ON DUPLICATE KEY UPDATE

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm implementing a memory cache for a table in MySQL that looks like this:

item1 item2 cnt

The original table includes millions of pairs like this and updates rapidly.

To make it all more efficient I want to write new pairs to an identical memory table and update the real table on disk periodically by cron.

The cron should do the following: for each pair, if there is similar pair in the non-memory table, increase the count by the count from the memory table.

If no such pair exists, create it with count from the memory table.

How can I make the flush (from memory table to real table) most efficient?

This is of course best achieved with INSERT ON DUPLICATE KEY UPDATE.

It's a MySQL flavor of INSERT operator that automatically handles key violations and updates the row that causes a violation instead of inserting.

If we provide a SELECT statement as a row source for INSERT ON DUPLICATE KEY UPDATE, multiple input rows may cause the key violation on the target row. In this case, the target row will be updated many times.

To achieve the desired result, we have two options:

  1. Use the memory table as a row source for INSERT ON DUPLICATE KEY UPDATE as is. The target rows will be updated multiple times.
  2. Aggregate the counters in the memory table and use the aggregated values as the row source. The target rows will be updated at most once.

The first solution will update the rows many times (which can be slow), while the second one involves some overhead on GROUP BY.

Let's create the sample tables and see which one is more efficient:
Read the rest of this entry »

Written by Quassnoi

June 27th, 2009 at 11:00 pm

Posted in MySQL

Included columns

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Peter asks:

I've been looking around for explanation of how INCLUDED columns in SQL Server indexes work and found myself totally confused.

Could you please explain what they are for?

Thanks!

In a couple of words, an INCLUDED column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.

This can save some DML overhead on columns which are filtered on non-ranged conditions.

If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.

This can help to get rid of expensive Key Lookups or Clustered Index Scans.

In one of applications I developed, I had a large table that kept the restaurant orders.

The table was very large both in row count and the row size: a little less than a hundred columns.

Among other fields there was a DATETIME column (the order date) and an INT column (the client id)

One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.

For the sake of brevity, I'll replace all these numerous columns with a single NVARCHAR(4000) column.

This will make the table definition more brief and comprehensible, leaving the table rows as large.

Here's what the table looked like:
Read the rest of this entry »

Written by Quassnoi

June 26th, 2009 at 11:00 pm

Posted in SQL Server

Building permutations: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I am doing a testing scenario.

There are 6 people in each site:

Site People
Site 1 A, B, C, D, E, F
Site 2 G, H, I, J, K, L
Site 3 M, N, O, P, Q, R
Site 4 S, T, U, V, W, X

I want to write a query that can suggests me datewise the people who can test the site, two at a time.

Here are the rules:

  1. If a person has tested his site on Day 1, then his turn should come on Day 4 and not before that.
    This, however, is applicable for current week. So if A and D test a site on 22nd, B and E test it on 23rd and C and F test on 24th, then for this week, A and D can test the site only on 25th. Sunday is a holiday.
  2. Every week, the pair should change. Rule 1 is still applicable with the new pair.
  3. A person belonging to a particular site cannot test other sites.

To build these rules, we need to do the following:

  1. For each site, split 6 people into 3 pairs.
  2. Assign each pair to the appropriate day of week (pair 1 for Monday and Thursday, pair 2 for Tuesday and Friday, pair 3 for Wednesday and Saturday)
  3. For each new week, split the 6 people new way so that no pairs persist.

Step 3 is most interesting here, since it requires building permutations.

To build the pairs, we need to arrange 6 people and take the pairs from the arrangements.

There are 6! = 720 distinct ways to arrange 6 people.

But for our task the following pairs:

  • (1, 2) and (2, 1)
  • (3, 4) and (4, 3)
  • (5, 6) and (6, 5)

are considered equal. That's why we can use only 720 / 8 = 90 arrangements which yield non-repeating pairs.

SQL Server does not provide a native way to build all permutations of a set. We will have to emulate it using recursive CTE's.

Let's create a table with required data:
Read the rest of this entry »

Written by Quassnoi

June 25th, 2009 at 11:00 pm

Posted in SQL Server