EXPLAIN EXTENDED

How to create fast database queries

Archive for July, 2009

Finding duplicates on either of the fields

Comments enabled. I *really* need your comment

From Stack Overflow:

This query for creating a list of сandidate duplicates is easy enough:

SELECT  COUNT(*), Can_FName, Can_HPhone, Can_EMail
FROM    can 
GROUP BY
        Can_FName, Can_HPhone, Can_EMail
HAVING  COUNT(*) > 1

But if the actual rule I want to check against is FName and (HPhone OR Email) — how can I adjust the GROUP BY to work with this?

This is quite a common need.

The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.

Let's imagine a sample dataset:

id name phone email
1 John 555-00-00 john@example.com
2 John 555-00-01 john@example.com
3 John 555-00-01 john-other@example.com
4 James 555-00-00 james@example.com
5 James 555-00-01 james-other@example.com

We see that records 1 and 2 are duplicates (I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.

But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.

Given this, we should use similarity chains instead of GROUP BY.

A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).

Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.

This set can be identified by the id of the least node.

We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 8th, 2009 at 11:00 pm

Posted in SQL Server

Selecting last forum posts

Comments enabled. I *really* need your comment

From Stack Overflow:

I built a custom forum for my site using MySQL.

The listing page is essentially a table with the following columns: Topic, Last Updated, and Replies.

The DB table has the following columns:

id name body date topic_id email

A topic has the topic_id of 0, and replies have the topic_id of their parent topic.

How do I query 20 last updated topics efficiently?

We have two kinds of topics here: those that had been answered and those that had not.

The queries for these two kinds of topics should be optimized using different methods and it's better to query for them using two different queries merged later using UNION ALL:

  • To find last 20 replies, we will employ the technique described in this article:

    This technique uses a composite index on (date, id) to find 20 latest replies to distinct topic_id's:

    1. First, it scans the index on (date, id) to retrieve records with a non-zero topic_id in descending date order.
    2. For each record found on step 1, it scans the index on (date, id) again in descending order and finds the first record with the same topic_id.
    3. If the id's of records found on steps 1 and 2 match, this means that the record found on step 1 is the latest for this topic_id and we should return it.
  • To find last 20 unreplied topics, we just use a good old NOT EXISTS.

    In this case a composite index on (topic_id, date) will come handy: we will need to select the last 20 unreplied topics, and this index will be useful to implement both filtering on topic_id and ordering by date DESC.

  • Finally, we will merge these resultsets using UNION ALL and limit the result.

Now, let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 7th, 2009 at 11:00 pm

Posted in MySQL

Overlapping numbers

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Kat asks:

I want to determine if I have any overlapping ranges of numbers in a table for specific id's.

For example, the table has the following:

ID START_NUMBER END_NUMBER
1 3035112000 3035112010
2 3035112011 3035112020
3 3035112003 3035112005
4 3032925000 3032925010
1 3035113000 3035113050
2 3035113000 3035113050
3 3035113051 3035113052
4 3032925500 3032925550

I only want to pull those that have overlapping number ranges for id's 1, 2, and 3 (but not 4).

This is in Oracle.

We need to write a query to pull out all id's of rows that overlap at least one other row in the table.

Let's create a pair of sample tables:
Read the rest of this entry »

Written by Quassnoi

July 6th, 2009 at 11:00 pm

Posted in Oracle

Schema junk

with one comment

Many tasks in SQL world can be efficiently performed using stored procedures.

All database management systems of the Big Four (that is Oracle, SQL Server, MySQL and PostgreSQL) have a concept of a stored procedure: a little server-managed program written in a procedural language that allows running SQL statements and returning results.

They are widely recommended for use, since they allow easy separation of business rules from data storage rules, fine-grained security, integrity control and all other things software analysts mumble for their living.

People widely accept this theory and if you ask anyone should I use the stored procedures?, the answer you will get will be a unanimous yes.

The problem is that people rarely use them in fact in real applications. Everybody knows they should but nobody really does.

More than that: despite the fact that stored procedures in fact can be more efficient than plain queries (even for one-time tasks), people often prefer the latter. Writing a procedure to obtain a resultset seems to be an overkill from the point of view of most developers, they rather prefer doing it with a plain SQL query.

Why so?

Let me tell you another boring story. Don't worry, it won't take long.
Read the rest of this entry »

Written by Quassnoi

July 5th, 2009 at 11:00 pm

Posted in Miscellaneous

Searching for NULL columns

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to make sure I'm not inserting a duplicate row into my table (i. e. only PRIMARY KEY different).

All my fields allow NULL's as I've decided NULL to mean all values.

Because of NULL's, the following statement in my stored procedure can't work:

IF EXISTS
        (
        SELECT  *
        FROM    MY_TABLE
        WHERE   MY_FIELD1 = @IN_MY_FIELD1
                AND MY_FIELD2 = @IN_MY_FIELD2
                AND MY_FIELD3 = @IN_MY_FIELD3 
                AND MY_FIELD4 = @IN_MY_FIELD4
                AND MY_FIELD5 = @IN_MY_FIELD5
                AND MY_FIELD6 = @IN_MY_FIELD6
        )
BEGIN
        goto on_duplicate
END

, since NULL = NULL is not true.

How can I check for the duplicates without having an IF ISNULL statement for every column?

First of all, can't help being a smartass and tell that a UNIQUE index should be defined over these columns.

SQL Server does index NULL values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.

Now, back to business. Of course there are cases when creating such an index is not acceptable.

In these cases a very nice but little known operator INTERSECT comes handy.

This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server's understands common: each value in each column should be either equal to one in the other row or they both should be NULL's.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 4th, 2009 at 11:00 pm

Posted in SQL Server

Overlapping ranges: PostgreSQL

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Finally, PostgreSQL.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)

Now, let's create a sample table in PostgreSQL:
Read the rest of this entry »

Written by Quassnoi

July 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

Overlapping ranges: Oracle

Comments enabled. I *really* need your comment

Continuing the theme on overlapping ranges:

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Today, let's see it in Oracle.

A quick reminder: we have two options to write a condition for such a query.

A more simple one:

start_date < @week_end AND end_date > @week_start

is, uh, more simple, but less efficient in SQL Server and MySQL.

A more complex one:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

is more index friendly.

Let's create a sample table and see how Oracle copes with these conditions:
Read the rest of this entry »

Written by Quassnoi

July 2nd, 2009 at 11:00 pm

Posted in Oracle

Overlapping ranges: MySQL

with one comment

Continuing yesterday's post.

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

Now, let's see how do it in MySQL.

In the previous post we saw that SQL Server is not very good in optimizing this construct:

start_date < @week_end AND end_date > @week_start

and using it leads to a full table scan. A more complex condition is more efficient in SQL Server:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

, since two different indexes can be applied to both parts of the condition.

However, the second condition is intrinsically not very efficient which is a problem for almost any datetime range analysis.

A plain B-Tree index is efficient for queries like this:

constant1 < column < constant1

, but not for the queries like this:

column1 < constant < column2

The latter condition is in fact rewritten as following by the optimizer:

column1 < constant AND column2 > constant

and the optimizer is forced to choose between two non-optimal options: either stick to one index and do the filtering (using table lookups) or perform two index lookups and merge the results.

If 1,000,000 rows satisfy the first condition and 1,000,000 satisfy the second one, the engine needs to inspect either 1,000,000 rows with table lookups or 2,000,000 rows using indexes, only to find out that the intersection contains but 10 rows.

Fortunately, MySQL supports R-Tree indexes (called SPATIAL in MySQL because they are primarily used to query spatial data).

Indexes of this kind are particularly good for the queries similar to the one above, and with a little effort they can be used to query time data too.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 1st, 2009 at 11:00 pm

Posted in MySQL