EXPLAIN EXTENDED

How to create fast database queries

Author Archive

INNER JOIN vs. CROSS APPLY

with 48 comments

From Stack Overflow:

Can anyone give me a good example of when CROSS APPLY makes a difference in those cases where INNER JOIN will work as well?

This is of course SQL Server.

A quick reminder on the terms.

INNER JOIN is the most used construct in SQL: it joins two tables together, selecting only those row combinations for which a JOIN condition is true.

This query:

SELECT  *
FROM    table1
JOIN    table2
ON      table2.b = table1.a

reads:

For each row from table1, select all rows from table2 where the value of field b is equal to that of field a

Note that this condition can be rewritten as this:

SELECT  *
FROM    table1, table2
WHERE   table2.b = table1.a

, in which case it reads as following:

Make a set of all possible combinations of rows from table1 and table2 and of this set select only the rows where the value of field b is equal to that of field a

These conditions are worded differently, but they yield the same result and database systems are aware of that. Usually both these queries are optimized to use the same execution plan.

The former syntax is called ANSI syntax, and it is generally considered more readable and is recommended to use.

However, it didn't make it into Oracle until recently, that's why there are many hardcore Oracle developers that are just used to the latter syntax.

Actually, it's a matter of taste.

To use JOINs (with whatever syntax), both sets you are joining must be self-sufficient, i. e. the sets should not depend on each other. You can query both sets without ever knowing the contents on another set.

But for some tasks the sets are not self-sufficient. For instance, let's consider the following query:

We have table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id

We cannot come up with a join condition here. The join condition, should it exist, would involve the row number, which is not present in table2, and there is no way to calculate a row number only from the values of columns of any given row in table2.

That's where the CROSS APPLY can be used.

CROSS APPLY is a Microsoft's extension to SQL, which was originally intended to be used with table-valued functions (TVF's).

The query above would look like this:

SELECT  *
FROM    table1
CROSS APPLY
(
SELECT  TOP (table1.rowcount) *
FROM    table2
ORDER BY
id
) t2

For each from table1, select first table1.rowcount rows from table2 ordered by id

The sets here are not self-sufficient: the query uses values from table1 to define the second set, not to JOIN with it.

The exact contents of t2 are not known until the corresponding row from table1 is selected.

I previously said that there is no way to join these two sets, which is true as long as we consider the sets as is. However, we can change the second set a little so that we get an additional computed field we can later join on.

The first option to do that is just count all preceding rows in a subquery:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*,
                (
                SELECT  COUNT(*)
                FROM    table2 t2i
                WHERE   t2i.id <= t2o.id
                ) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

The second option is to use a window function, also available in SQL Server since version 2005:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*, ROW_NUMBER() OVER (ORDER BY id) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

This function returns the ordinal number a row would have, be the ORDER BY condition used in the function applied to the whole query.

This is essentially the same result as the subquery used in the previous query.

Now, let's create the sample tables and check all these solutions for efficiency:

Read the rest of this entry »

Written by Quassnoi

July 16th, 2009 at 11:00 pm

Posted in SQL Server

Oracle: OR on multiple EXISTS clauses

Comments enabled. I *really* need your comment

From Stack Overflow:

I have two queries, and I want to understand which is better in terms of performance and memory:

SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_b b,
        tbl_a a ,
        tbl_c c,
        tbl_d d
WHERE   (
        b.id1 = a.id1
        AND a.id1 = c.id1
        AND UPPER(c.flag) = 'Y'
        AND c.id1 = d.id1
        )
        OR
        (
        b.id2 = a.id2
        AND a.id2 = c.id2
        AND UPPER(c.flag) = 'Y'
        AND c.id2 = d.id2
        )
        AND d.id3 = 10

and

SELECT  DISTINCT
        a.no,
        a.id1,
        a.id2
FROM    tbl_a a
WHERE   EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_c c
        WHERE   (a.id1 = c.id1 OR a.id2 = c.id2)
                AND UPPER(c.flag) = 'Y'
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_b b
        WHERE   b.id1 = a.id1 OR b.id2 = a.id2
        )
        AND EXISTS
        (
        SELECT  a.id1, a.id2
        FROM    tbl_d d
        WHERE   (a.id1 = d.id1 or a.id2 = d.id2)
                AND d.id3 = 10
        )

The tables tbl_b and tbl_d are very large tables containing 500,000 to millions of rows, while table tbl_a is relatively small.

My requirement is to pick up only those records from table tbl_a, whose id (either id1 or id2) is available in tbl_b, tbl_c, and tbl_d tables, satisfying certain other conditions as well.

Which is best performance-wise?

We can see that both these queries contain an OR condition, a nightmare for most optimizers.

The first query uses a join on all four tables, concatenating the results and making a distinct set out of them.

The second query check each row in tbl_a, making sure that the corresponding records exists in other tables in one or another way.

These queries are not identical: the first query will select the rows from tbl_a matching all tables on same id (either three matches on id1 or three matches on id2), while the second query returns rows matching on any id

This is, if we have a row matching tbl_b and tbl_c on id1 and tbl_d on id2, this row will be returned by the second query but not the first.

Both these queries will perform poorly on large tables. However, we can improve them.

Let's create the tables, fill them with sample data and make the improvements:
Read the rest of this entry »

Written by Quassnoi

July 15th, 2009 at 11:00 pm

Posted in Oracle

Flattening timespans: PostgreSQL 8.4

with one comment

On Jul 1, 2009, PostgreSQL 8.4 was released.

Among other imporvements, this version supports window functions, recursive queries and common table expressions (CTE's).

Despite being a minor release (accoring to the version numeration), this version can become quite a milestone, since these features make developer's life much, much easier.

Let's check how efficienly these features are implemented.

To do this, I'll take some tasks that I wrote about in the previous blog posts and try to reimplement them using new PostgreSQL features.

I'll start with quite a common task of flattening the intersecting timespans which I wrote about in this article:

This task requires calculating a running maximum and taking a previous record from a recordset, and therefore is a good illustration for window functions.

A quick reminder of the problem, taken from Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop
2009.06.03 05:54:48:000 10:00:13:000
2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop
2009.06.03 10:12:50:000 10:27:25:000
2009.06.03 10:13:12:000 11:14:56:000
2009.06.03 10:27:25:000 10:27:31:000
2009.06.03 10:27:39:000 13:53:38:000
2009.06.03 11:14:56:000 11:15:03:000
2009.06.03 11:15:30:000 14:02:14:000
2009.06.03 13:53:38:000 13:53:43:000
2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 14th, 2009 at 11:00 pm

Posted in PostgreSQL

SQL Server: aggregate bitwise OR

with 2 comments

From Stack Overflow:

I am creating a script for merging and deleting duplicate rows from a table.

The table contains address information, and uses an integer field for storing information about the email as bit flags (column name value). For example, if bit 1 is set in value, that means the record is a primary address.

There are instances of the same email being entered twice, but sometimes with different values. To resolve this, I need to take the value from all duplicates, assign them to one surviving record and delete the rest.

My biggest headache so far as been with the merging of the records. What I want to do is bitwise OR all values of duplicate records together.

From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.

However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.

We have three problems here:

  1. Select a first record for each set of duplicates
  2. Update this record with bitwise OR of all values in its set
  3. Delete all other records

Step 1 is easy to do using ROW_NUMBER().

Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.

See this article I wrote some time ago on how to do this:

Now, the main problem is step 2.

SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.

The main idea here is that for bit values, aggregate OR and AND can be replaced with MAX and MIN, accordingly.

All we need is to split each value into the bits, aggregate each bit and merge the results together.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 13th, 2009 at 11:00 pm

Posted in SQL Server

Double-thinking in SQL

with 18 comments

One of the first things a novice SQL developer learns about is called thinking in SQL, which is usually being opposed to procedural thinking

Let's see what part of brain does this intellectual activity take and how to use it.

Two features distinguish SQL from other languages you learned as a 11-year old kid on your first PC, like BASIC or perl or maybe even C++ if you're such a Wunderkind.

First, SQL is set-based. It does things with sets.

Every tool is designed to do things with something else. Like, you use a hammer to do things with nails, or use a screwdriver to do things with screws, or use an oven to do things with food.

Same with computer languages.

BASIC does things with variables. perl does things with scalars, arrays, hashes and file streams. Assembly does things with registers and memory.

You should not be confused by something like registers are just a special case of variables, or a hash is just a generalized container which exposes this and this method or something like that. No.

A hash is a hash, a variable is a variable and a register is a register.

Like, an egg is a food and rice is a food and it's possible to cook some eggs in a rice cooker and vice versa, but they are just wrong tools to do that.

Akihabara district in Tokyo City. You can buy any type of gadget you can imagine, which is about 2% of what you will find there.

Akihabara district in Tokyo City. You can buy any type of gadget you can imagine, which is about 2% of what you will find there.

Prehistoric men had to make do with hammerstones and lithic blades (even to court their women), but now we have a whole district in Tokyo City for gadgets with USB type A, and another district for gadgets with USB type B.

So if you feel the urge to hash something and then make a good old array out of this, you don't use assembly, but perl or PHP instead.

Same with SQL. SQL does things with sets.

It's a tool that allows you to take a dozen or two of sets, mix them together, knead and wedge them then chop them apart and mix again, but the output you get is still a set and all inputs are sets.

Eeverything you do in SQL, you do it on sets. That's why SQL is called a set-oriented language.

Ok, that was the first feature that distinguishes SQL from other languages. What's the second one?

SQL is a declarative language. This means that you express what you want to do with sets, not how you want to do it.

This requires a little explanation.
Read the rest of this entry »

Written by Quassnoi

July 12th, 2009 at 11:00 pm

Posted in Miscellaneous

Selecting compatible articles

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).

So for a list of article numbers A, B,… , N the question is:

Give me all articles that are compatible with A and B and … and N

For example, consider the table

A B
1 2
3 1
3 4

If I wanted all articles that are compatible with 1, the query would return (2, 3).

The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.

This table describes a friendship: a symmetric irreflexive binary relation.

That is:

  • For any given a, b, if a is a friend to b, then b is a friend to a
  • For any given a, a is never a friend to itself

This relation is heavily used by social networks.

A normalized table describing this relation should be defined like this:

CREATE TABLE t_set (
a INT NOT NULL,
b INT NOT NULL
)
ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b) [/sourcecode] , the check being added to account for the relation symmetry. Complete relation can be retrieved with the following query: Read the rest of this entry »

Written by Quassnoi

July 10th, 2009 at 11:00 pm

Posted in SQL Server

Selecting birthdays

Comments enabled. I *really* need your comment

Answering questions asked on the site.

James asks:

I'm developing a forum and want to select all users that have a birthday within the next 3 days.

How do I do it?

This is in SQL Server

This is a very nice feature which every decent forum should have, and I'll be glad to answer this question.

Unfortunately you didn't provide the names of your tables, so I'll have to make them up.

It's not much of a stretch to assume that your table is called t_user and you keep the user's birthdates in a DATETIME field called birthdate.

A birthday within next 3 days means that if you add the person's age to the person's birthdate, you get a date between the current date and three days after it.

To check this, we just need to calculate the number of months between the dates and make sure that it's divisible by 12 with reminder of 0 or 11 (to handle month transitions).

Then we need to add a transition month and divide the number of months by 12. The quotient will give us the number of years we need to add to the birthdate to compare the result with GETDATE().

Let's create a sample table and see how to do it:
Read the rest of this entry »

Written by Quassnoi

July 9th, 2009 at 11:00 pm

Posted in SQL Server

Finding duplicates on either of the fields

Comments enabled. I *really* need your comment

From Stack Overflow:

This query for creating a list of сandidate duplicates is easy enough:

SELECT  COUNT(*), Can_FName, Can_HPhone, Can_EMail
FROM    can 
GROUP BY
        Can_FName, Can_HPhone, Can_EMail
HAVING  COUNT(*) > 1

But if the actual rule I want to check against is FName and (HPhone OR Email) — how can I adjust the GROUP BY to work with this?

This is quite a common need.

The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.

Let's imagine a sample dataset:

id name phone email
1 John 555-00-00 john@example.com
2 John 555-00-01 john@example.com
3 John 555-00-01 john-other@example.com
4 James 555-00-00 james@example.com
5 James 555-00-01 james-other@example.com

We see that records 1 and 2 are duplicates (I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.

But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.

Given this, we should use similarity chains instead of GROUP BY.

A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).

Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.

This set can be identified by the id of the least node.

We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 8th, 2009 at 11:00 pm

Posted in SQL Server

Selecting last forum posts

Comments enabled. I *really* need your comment

From Stack Overflow:

I built a custom forum for my site using MySQL.

The listing page is essentially a table with the following columns: Topic, Last Updated, and Replies.

The DB table has the following columns:

id name body date topic_id email

A topic has the topic_id of 0, and replies have the topic_id of their parent topic.

How do I query 20 last updated topics efficiently?

We have two kinds of topics here: those that had been answered and those that had not.

The queries for these two kinds of topics should be optimized using different methods and it's better to query for them using two different queries merged later using UNION ALL:

  • To find last 20 replies, we will employ the technique described in this article:

    This technique uses a composite index on (date, id) to find 20 latest replies to distinct topic_id's:

    1. First, it scans the index on (date, id) to retrieve records with a non-zero topic_id in descending date order.
    2. For each record found on step 1, it scans the index on (date, id) again in descending order and finds the first record with the same topic_id.
    3. If the id's of records found on steps 1 and 2 match, this means that the record found on step 1 is the latest for this topic_id and we should return it.
  • To find last 20 unreplied topics, we just use a good old NOT EXISTS.

    In this case a composite index on (topic_id, date) will come handy: we will need to select the last 20 unreplied topics, and this index will be useful to implement both filtering on topic_id and ordering by date DESC.

  • Finally, we will merge these resultsets using UNION ALL and limit the result.

Now, let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 7th, 2009 at 11:00 pm

Posted in MySQL

Overlapping numbers

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Kat asks:

I want to determine if I have any overlapping ranges of numbers in a table for specific id's.

For example, the table has the following:

ID START_NUMBER END_NUMBER
1 3035112000 3035112010
2 3035112011 3035112020
3 3035112003 3035112005
4 3032925000 3032925010
1 3035113000 3035113050
2 3035113000 3035113050
3 3035113051 3035113052
4 3032925500 3032925550

I only want to pull those that have overlapping number ranges for id's 1, 2, and 3 (but not 4).

This is in Oracle.

We need to write a query to pull out all id's of rows that overlap at least one other row in the table.

Let's create a pair of sample tables:
Read the rest of this entry »

Written by Quassnoi

July 6th, 2009 at 11:00 pm

Posted in Oracle