EXPLAIN EXTENDED

How to create fast database queries

Archive for June, 2009

Overlapping ranges: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

This query needs to find all events that overlap the week.

The condition for overlapping ranges is well known and quite simple:

start_date < @week_end AND end_date > @week_start

However, in SQL more simple doesn't always mean more efficient, and the following condition, despite being more complex, can yield better performance:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

Let's create a sample table and see how so:
Read the rest of this entry »

Written by Quassnoi

June 30th, 2009 at 11:00 pm

Posted in SQL Server

The Island of Misfit Cursors

with 3 comments

When I was in the elementary school, I once visited my friend.

His mother greeted us, invited us inside and told us to wash our hands and go to the kitchen. Then she turned on a giant electrical samovar to make us some tea.

When it boiled, she switched it off and the friend took the teapot to fill it with hot water. Wait, said his mother. Didn't you forget something? Sorry, mum, said he, took a long steel spoon and stirred the water inside the samovar with it.

What's that you are doing? asked I. His mom replied: It's good for the samovar. It will live longer if you stir water with a metal spoon after it boiled, the manual says so.

Never knew about it, but OK, good to know.

About a year later my parents and I went to visit my grandmother. We kissed hello and she put exactly same electrical samovar on to boil. When it boiled, I turned it off and took a spoon to stir water. Wait! the granny shouted. What are you going to do, kill yourself? You don't put metal things inside the electric appliances!

I was quite confused. My friend's mom adviced to do this, and the friend did, and everything is OK. On the other had, granny's statement also made sense: metal is a bad thing to have between you and electricity.

I was 10 or so that time so I forgot that case quickly.

But when the grandma died, she left some of her belongings to me. I went back to her old house and found the box with the samovar. I recalled the spoon and got curious. The box had a manual inside. I opened the page with the warnings and among other things found the following: With this new model, it is safe to stir water using metal objects.

As you may know, people never read manuals, and when they do, they get them wrong.
Read the rest of this entry »

Written by Quassnoi

June 28th, 2009 at 11:00 pm

Posted in Miscellaneous

Efficient INSERT ON DUPLICATE KEY UPDATE

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm implementing a memory cache for a table in MySQL that looks like this:

item1 item2 cnt

The original table includes millions of pairs like this and updates rapidly.

To make it all more efficient I want to write new pairs to an identical memory table and update the real table on disk periodically by cron.

The cron should do the following: for each pair, if there is similar pair in the non-memory table, increase the count by the count from the memory table.

If no such pair exists, create it with count from the memory table.

How can I make the flush (from memory table to real table) most efficient?

This is of course best achieved with INSERT ON DUPLICATE KEY UPDATE.

It's a MySQL flavor of INSERT operator that automatically handles key violations and updates the row that causes a violation instead of inserting.

If we provide a SELECT statement as a row source for INSERT ON DUPLICATE KEY UPDATE, multiple input rows may cause the key violation on the target row. In this case, the target row will be updated many times.

To achieve the desired result, we have two options:

  1. Use the memory table as a row source for INSERT ON DUPLICATE KEY UPDATE as is. The target rows will be updated multiple times.
  2. Aggregate the counters in the memory table and use the aggregated values as the row source. The target rows will be updated at most once.

The first solution will update the rows many times (which can be slow), while the second one involves some overhead on GROUP BY.

Let's create the sample tables and see which one is more efficient:
Read the rest of this entry »

Written by Quassnoi

June 27th, 2009 at 11:00 pm

Posted in MySQL

Included columns

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Peter asks:

I've been looking around for explanation of how INCLUDED columns in SQL Server indexes work and found myself totally confused.

Could you please explain what they are for?

Thanks!

In a couple of words, an INCLUDED column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.

This can save some DML overhead on columns which are filtered on non-ranged conditions.

If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.

This can help to get rid of expensive Key Lookups or Clustered Index Scans.

In one of applications I developed, I had a large table that kept the restaurant orders.

The table was very large both in row count and the row size: a little less than a hundred columns.

Among other fields there was a DATETIME column (the order date) and an INT column (the client id)

One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.

For the sake of brevity, I'll replace all these numerous columns with a single NVARCHAR(4000) column.

This will make the table definition more brief and comprehensible, leaving the table rows as large.

Here's what the table looked like:
Read the rest of this entry »

Written by Quassnoi

June 26th, 2009 at 11:00 pm

Posted in SQL Server

Building permutations: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I am doing a testing scenario.

There are 6 people in each site:

Site People
Site 1 A, B, C, D, E, F
Site 2 G, H, I, J, K, L
Site 3 M, N, O, P, Q, R
Site 4 S, T, U, V, W, X

I want to write a query that can suggests me datewise the people who can test the site, two at a time.

Here are the rules:

  1. If a person has tested his site on Day 1, then his turn should come on Day 4 and not before that.
    This, however, is applicable for current week. So if A and D test a site on 22nd, B and E test it on 23rd and C and F test on 24th, then for this week, A and D can test the site only on 25th. Sunday is a holiday.
  2. Every week, the pair should change. Rule 1 is still applicable with the new pair.
  3. A person belonging to a particular site cannot test other sites.

To build these rules, we need to do the following:

  1. For each site, split 6 people into 3 pairs.
  2. Assign each pair to the appropriate day of week (pair 1 for Monday and Thursday, pair 2 for Tuesday and Friday, pair 3 for Wednesday and Saturday)
  3. For each new week, split the 6 people new way so that no pairs persist.

Step 3 is most interesting here, since it requires building permutations.

To build the pairs, we need to arrange 6 people and take the pairs from the arrangements.

There are 6! = 720 distinct ways to arrange 6 people.

But for our task the following pairs:

  • (1, 2) and (2, 1)
  • (3, 4) and (4, 3)
  • (5, 6) and (6, 5)

are considered equal. That's why we can use only 720 / 8 = 90 arrangements which yield non-repeating pairs.

SQL Server does not provide a native way to build all permutations of a set. We will have to emulate it using recursive CTE's.

Let's create a table with required data:
Read the rest of this entry »

Written by Quassnoi

June 25th, 2009 at 11:00 pm

Posted in SQL Server

Constant vs. NULL to mark missing values in OUTER JOINs

Comments enabled. I *really* need your comment

From Stack Overflow (shortened for clarity):

I have two tables: homes and home_photos:

It's very possible that a home does not have a photo associated with the home.

In that case, the column primary_photo_group_id will have value 0.

Otherwise, primary_photo_group_id equals the group_id of the photo.

How do I return a NULL if the photo if not defined for a home?

This is of course a classic task for an OUTER JOIN:

SELECT  *
FROM    homes h
LEFT JOIN
        photos p
ON      p.home_photo_group_id = h.primary_photo_group_id
        AND …

This construct, however, can be optimized, and I'll show how.

Let's create and fill the sample tables:
Read the rest of this entry »

Written by Quassnoi

June 24th, 2009 at 11:00 pm

Posted in MySQL

IN with a comma separated list: SQL Server

with 3 comments

From Stack Overflow:

How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?

Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.

One should pass lists in a table variable rather than in a set of comma separated values.

However, there are legacy applications, ORM's and these kinds of stuff that just leave you no choice.

MySQL provides a nice function find_in_set designed exactly for this task.

SQL Server does not directly support an IN predicate for CSV, neither it provides find_in_set. It needs a rowset for IN predicate to work.

To expand a comma separated list into a rowset, we will use a recursive CTE.

In the CTE, we should search for the position of every next comma and return it.

Then we can just take the substring between two commas and cast it into an INT:

DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15, 296, 5358';
WITH    cd AS
        (
        SELECT  1 AS first, CHARINDEX(',', @lid, 1) AS next
        UNION ALL
        SELECT  next + 1, CHARINDEX(',', @lid, next + 1)
        FROM    cd
        WHERE   next > 0
        ),
        lid AS
        (
        SELECT  CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
        FROM    cd
        )
SELECT  *
FROM    lid
id
3
14
15
296
5358

Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.

Let's create a sample table and build a query to do this:
Read the rest of this entry »

Written by Quassnoi

June 23rd, 2009 at 11:00 pm

Posted in SQL Server

Indexing tiny tables

with one comment

From Stack Overflow:

Hypothetically, in a SQL Server database, if I have a table with two INT fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two INT fields overcomes the overhead imposed by said indexes?

Very nice question.

Imagine we have a table table1 (col INT NOT NULL, value1 INT NOT NULL, value2 VARCHAR(100) NOT NULL, …) with an index on col.

In SQL Server, an index can improve the following things.

  1. Index lookups (range scans). An index can be used to limit the number of rows scanned:

    SELECT  *
    FROM    table1 t1
    WHERE   t1.col BETWEEN @start AND @end
    
  2. Secondary data source. This query will get all data it need from the index:
    SELECT  t1.col
    FROM    table1 t1
    
  3. Ordered rowsets. This query will not need to sort:
    SELECT  *
    FROM    table1 t1
    ORDER BY
            t1.col
    
  4. Efficient MIN / MAX:
    SELECT  MIN(col)
    FROM    table1 t1
    

The latter case is probably the best illustration of the fact that even a table of 2 rows can benefit from creating an index.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

June 22nd, 2009 at 11:00 pm

Posted in SQL Server

Click

with 2 comments

Back in the end of 90's I had been employed with a small startup which was founded to develop a restaurant management system.

The job listing they posted said PostgreSQL wanted (sic). And PostgreSQL for sure was something I knew about. You see, I already had experience with PostgreSQL by that time. I was able to install PostgreSQL, create a database, run a query against it and even create a trigger in C (I didn't yet know of pl/pgSQL).

So I applied for the job, got hired and started coding.

Coding what you may ask?

Well, I had no idea. And in fact, nobody did. There were only two requirements for the system.

First, it had to work somehow with a Casio Cassiopeia, which allowed a wireless network module to be installed and used. A waiter had to be able to take your order and it had to get to the kitchen instantly.

Second one was personal. The founders of the company (and me too) used to dine in a small cafe at the basement of the office building. They had a menu, but it didn't correlate with reality. You literally couldn't get anything that was in the menu. They were constantly out of meat, out of tomatoes, out of potatoes (potatoes!). Instead they put a small hand-written piece of paper into the menu which listed everything the had at the time (a soup and a something else), that's if they were in good mood. If they were not, you just placed an order, and a waiter came back in 10 minutes to tell you it was beyond their ability to get some tomatoes and cucumbers to chop you some salad, try ordering something else, thank you.

In our system, that was never ever going to happen. Never. If the kitchen was short of something they needed to cook you a dish you ordered, the system would warn the waiter instantly, and she would tell you to order something else right this minute, that's instead of going to the kitchen and back.

It had to be a killer feature.
Read the rest of this entry »

Written by Quassnoi

June 21st, 2009 at 11:00 pm

Posted in Miscellaneous

NVL vs. COALESCE

with 2 comments

From Stack Overflow:

Are there non obvious differences between NVL and COALESCE in Oracle?

The obvious differences are that COALESCE will return the first non-NULL item in its parameter list whereas NVL only takes two parameters and returns the first if it is not NULL, otherwise it returns the second.

It seems that NVL may just be a base case version of COALESCE.

Am I missing something?

In case of two agruments of same type, these functions are basically the same: they both return first non-NULL values.

However, they are implemented differently:

  • NVL always evaluates both arguments
  • COALESCE doesn't evaluate anything after it finds the first non-NULL argument

This behavior can be utilized.

Here's a real world example from a restaurant management system I developed about 10 years ago.

A bill that is presented to a customer consists of the items he or she ordered.

Majority of the dishes ordered are plain menu items: something that has its price printed in the menu.

However, there are some special things which are not included to the menu: a customer may want to order something exclusive and pay for it.

The manager should estimate the cost of ingredients, time and effort to make this dish, then calculate the price and name it: everything in 30 seconds.

These things are of course come overpriced, but the customers that make such orders don't seem to care much about money.

In the bill that appears as special order. When a manager enters it into the computer, she should also enter the price she had calculated.

These prices then get back to the server and are stored in a table (not the one where regular menu prices are stored).

At the end of month, a report to the month's receipts should be generated.

Here are the tables containing the data relevant to the reports. They of course are simplified, but the principle remains the same.
Read the rest of this entry »

Written by Quassnoi

June 20th, 2009 at 11:00 pm

Posted in Oracle