EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘SQL Server’ Category

INNER JOIN vs. CROSS APPLY

with 48 comments

From Stack Overflow:

Can anyone give me a good example of when CROSS APPLY makes a difference in those cases where INNER JOIN will work as well?

This is of course SQL Server.

A quick reminder on the terms.

INNER JOIN is the most used construct in SQL: it joins two tables together, selecting only those row combinations for which a JOIN condition is true.

This query:

SELECT  *
FROM    table1
JOIN    table2
ON      table2.b = table1.a

reads:

For each row from table1, select all rows from table2 where the value of field b is equal to that of field a

Note that this condition can be rewritten as this:

SELECT  *
FROM    table1, table2
WHERE   table2.b = table1.a

, in which case it reads as following:

Make a set of all possible combinations of rows from table1 and table2 and of this set select only the rows where the value of field b is equal to that of field a

These conditions are worded differently, but they yield the same result and database systems are aware of that. Usually both these queries are optimized to use the same execution plan.

The former syntax is called ANSI syntax, and it is generally considered more readable and is recommended to use.

However, it didn't make it into Oracle until recently, that's why there are many hardcore Oracle developers that are just used to the latter syntax.

Actually, it's a matter of taste.

To use JOINs (with whatever syntax), both sets you are joining must be self-sufficient, i. e. the sets should not depend on each other. You can query both sets without ever knowing the contents on another set.

But for some tasks the sets are not self-sufficient. For instance, let's consider the following query:

We have table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id

We cannot come up with a join condition here. The join condition, should it exist, would involve the row number, which is not present in table2, and there is no way to calculate a row number only from the values of columns of any given row in table2.

That's where the CROSS APPLY can be used.

CROSS APPLY is a Microsoft's extension to SQL, which was originally intended to be used with table-valued functions (TVF's).

The query above would look like this:

SELECT  *
FROM    table1
CROSS APPLY
(
SELECT  TOP (table1.rowcount) *
FROM    table2
ORDER BY
id
) t2

For each from table1, select first table1.rowcount rows from table2 ordered by id

The sets here are not self-sufficient: the query uses values from table1 to define the second set, not to JOIN with it.

The exact contents of t2 are not known until the corresponding row from table1 is selected.

I previously said that there is no way to join these two sets, which is true as long as we consider the sets as is. However, we can change the second set a little so that we get an additional computed field we can later join on.

The first option to do that is just count all preceding rows in a subquery:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*,
                (
                SELECT  COUNT(*)
                FROM    table2 t2i
                WHERE   t2i.id <= t2o.id
                ) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

The second option is to use a window function, also available in SQL Server since version 2005:

SELECT  *
FROM    table1 t1
JOIN    (
        SELECT  t2o.*, ROW_NUMBER() OVER (ORDER BY id) AS rn
        FROM    table2 t2o
        ) t2
ON      t2.rn <= t1.rowcount

This function returns the ordinal number a row would have, be the ORDER BY condition used in the function applied to the whole query.

This is essentially the same result as the subquery used in the previous query.

Now, let's create the sample tables and check all these solutions for efficiency:

Read the rest of this entry »

Written by Quassnoi

July 16th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: aggregate bitwise OR

with 2 comments

From Stack Overflow:

I am creating a script for merging and deleting duplicate rows from a table.

The table contains address information, and uses an integer field for storing information about the email as bit flags (column name value). For example, if bit 1 is set in value, that means the record is a primary address.

There are instances of the same email being entered twice, but sometimes with different values. To resolve this, I need to take the value from all duplicates, assign them to one surviving record and delete the rest.

My biggest headache so far as been with the merging of the records. What I want to do is bitwise OR all values of duplicate records together.

From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.

However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.

We have three problems here:

  1. Select a first record for each set of duplicates
  2. Update this record with bitwise OR of all values in its set
  3. Delete all other records

Step 1 is easy to do using ROW_NUMBER().

Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.

See this article I wrote some time ago on how to do this:

Now, the main problem is step 2.

SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.

The main idea here is that for bit values, aggregate OR and AND can be replaced with MAX and MIN, accordingly.

All we need is to split each value into the bits, aggregate each bit and merge the results together.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 13th, 2009 at 11:00 pm

Posted in SQL Server

Selecting compatible articles

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).

So for a list of article numbers A, B,… , N the question is:

Give me all articles that are compatible with A and B and … and N

For example, consider the table

A B
1 2
3 1
3 4

If I wanted all articles that are compatible with 1, the query would return (2, 3).

The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.

This table describes a friendship: a symmetric irreflexive binary relation.

That is:

  • For any given a, b, if a is a friend to b, then b is a friend to a
  • For any given a, a is never a friend to itself

This relation is heavily used by social networks.

A normalized table describing this relation should be defined like this:

CREATE TABLE t_set (
a INT NOT NULL,
b INT NOT NULL
)
ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b) [/sourcecode] , the check being added to account for the relation symmetry. Complete relation can be retrieved with the following query: Read the rest of this entry »

Written by Quassnoi

July 10th, 2009 at 11:00 pm

Posted in SQL Server

Selecting birthdays

Comments enabled. I *really* need your comment

Answering questions asked on the site.

James asks:

I'm developing a forum and want to select all users that have a birthday within the next 3 days.

How do I do it?

This is in SQL Server

This is a very nice feature which every decent forum should have, and I'll be glad to answer this question.

Unfortunately you didn't provide the names of your tables, so I'll have to make them up.

It's not much of a stretch to assume that your table is called t_user and you keep the user's birthdates in a DATETIME field called birthdate.

A birthday within next 3 days means that if you add the person's age to the person's birthdate, you get a date between the current date and three days after it.

To check this, we just need to calculate the number of months between the dates and make sure that it's divisible by 12 with reminder of 0 or 11 (to handle month transitions).

Then we need to add a transition month and divide the number of months by 12. The quotient will give us the number of years we need to add to the birthdate to compare the result with GETDATE().

Let's create a sample table and see how to do it:
Read the rest of this entry »

Written by Quassnoi

July 9th, 2009 at 11:00 pm

Posted in SQL Server

Finding duplicates on either of the fields

Comments enabled. I *really* need your comment

From Stack Overflow:

This query for creating a list of сandidate duplicates is easy enough:

SELECT  COUNT(*), Can_FName, Can_HPhone, Can_EMail
FROM    can 
GROUP BY
        Can_FName, Can_HPhone, Can_EMail
HAVING  COUNT(*) > 1

But if the actual rule I want to check against is FName and (HPhone OR Email) — how can I adjust the GROUP BY to work with this?

This is quite a common need.

The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.

Let's imagine a sample dataset:

id name phone email
1 John 555-00-00 john@example.com
2 John 555-00-01 john@example.com
3 John 555-00-01 john-other@example.com
4 James 555-00-00 james@example.com
5 James 555-00-01 james-other@example.com

We see that records 1 and 2 are duplicates (I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.

But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.

Given this, we should use similarity chains instead of GROUP BY.

A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).

Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.

This set can be identified by the id of the least node.

We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 8th, 2009 at 11:00 pm

Posted in SQL Server

Searching for NULL columns

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to make sure I'm not inserting a duplicate row into my table (i. e. only PRIMARY KEY different).

All my fields allow NULL's as I've decided NULL to mean all values.

Because of NULL's, the following statement in my stored procedure can't work:

IF EXISTS
        (
        SELECT  *
        FROM    MY_TABLE
        WHERE   MY_FIELD1 = @IN_MY_FIELD1
                AND MY_FIELD2 = @IN_MY_FIELD2
                AND MY_FIELD3 = @IN_MY_FIELD3 
                AND MY_FIELD4 = @IN_MY_FIELD4
                AND MY_FIELD5 = @IN_MY_FIELD5
                AND MY_FIELD6 = @IN_MY_FIELD6
        )
BEGIN
        goto on_duplicate
END

, since NULL = NULL is not true.

How can I check for the duplicates without having an IF ISNULL statement for every column?

First of all, can't help being a smartass and tell that a UNIQUE index should be defined over these columns.

SQL Server does index NULL values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.

Now, back to business. Of course there are cases when creating such an index is not acceptable.

In these cases a very nice but little known operator INTERSECT comes handy.

This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server's understands common: each value in each column should be either equal to one in the other row or they both should be NULL's.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 4th, 2009 at 11:00 pm

Posted in SQL Server

Overlapping ranges: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have an event table that specifies a date range with start_date and end_date fields.

I have another date range, specified in code, that defines the current week as week_start and week_end.

I'd like to query all events for the week.

This query needs to find all events that overlap the week.

The condition for overlapping ranges is well known and quite simple:

start_date < @week_end AND end_date > @week_start

However, in SQL more simple doesn't always mean more efficient, and the following condition, despite being more complex, can yield better performance:


(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)

Let's create a sample table and see how so:
Read the rest of this entry »

Written by Quassnoi

June 30th, 2009 at 11:00 pm

Posted in SQL Server

Included columns

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Peter asks:

I've been looking around for explanation of how INCLUDED columns in SQL Server indexes work and found myself totally confused.

Could you please explain what they are for?

Thanks!

In a couple of words, an INCLUDED column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.

This can save some DML overhead on columns which are filtered on non-ranged conditions.

If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.

This can help to get rid of expensive Key Lookups or Clustered Index Scans.

In one of applications I developed, I had a large table that kept the restaurant orders.

The table was very large both in row count and the row size: a little less than a hundred columns.

Among other fields there was a DATETIME column (the order date) and an INT column (the client id)

One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.

For the sake of brevity, I'll replace all these numerous columns with a single NVARCHAR(4000) column.

This will make the table definition more brief and comprehensible, leaving the table rows as large.

Here's what the table looked like:
Read the rest of this entry »

Written by Quassnoi

June 26th, 2009 at 11:00 pm

Posted in SQL Server

Building permutations: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I am doing a testing scenario.

There are 6 people in each site:

Site People
Site 1 A, B, C, D, E, F
Site 2 G, H, I, J, K, L
Site 3 M, N, O, P, Q, R
Site 4 S, T, U, V, W, X

I want to write a query that can suggests me datewise the people who can test the site, two at a time.

Here are the rules:

  1. If a person has tested his site on Day 1, then his turn should come on Day 4 and not before that.
    This, however, is applicable for current week. So if A and D test a site on 22nd, B and E test it on 23rd and C and F test on 24th, then for this week, A and D can test the site only on 25th. Sunday is a holiday.
  2. Every week, the pair should change. Rule 1 is still applicable with the new pair.
  3. A person belonging to a particular site cannot test other sites.

To build these rules, we need to do the following:

  1. For each site, split 6 people into 3 pairs.
  2. Assign each pair to the appropriate day of week (pair 1 for Monday and Thursday, pair 2 for Tuesday and Friday, pair 3 for Wednesday and Saturday)
  3. For each new week, split the 6 people new way so that no pairs persist.

Step 3 is most interesting here, since it requires building permutations.

To build the pairs, we need to arrange 6 people and take the pairs from the arrangements.

There are 6! = 720 distinct ways to arrange 6 people.

But for our task the following pairs:

  • (1, 2) and (2, 1)
  • (3, 4) and (4, 3)
  • (5, 6) and (6, 5)

are considered equal. That's why we can use only 720 / 8 = 90 arrangements which yield non-repeating pairs.

SQL Server does not provide a native way to build all permutations of a set. We will have to emulate it using recursive CTE's.

Let's create a table with required data:
Read the rest of this entry »

Written by Quassnoi

June 25th, 2009 at 11:00 pm

Posted in SQL Server

IN with a comma separated list: SQL Server

with 3 comments

From Stack Overflow:

How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?

Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.

One should pass lists in a table variable rather than in a set of comma separated values.

However, there are legacy applications, ORM's and these kinds of stuff that just leave you no choice.

MySQL provides a nice function find_in_set designed exactly for this task.

SQL Server does not directly support an IN predicate for CSV, neither it provides find_in_set. It needs a rowset for IN predicate to work.

To expand a comma separated list into a rowset, we will use a recursive CTE.

In the CTE, we should search for the position of every next comma and return it.

Then we can just take the substring between two commas and cast it into an INT:

DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15, 296, 5358';
WITH    cd AS
        (
        SELECT  1 AS first, CHARINDEX(',', @lid, 1) AS next
        UNION ALL
        SELECT  next + 1, CHARINDEX(',', @lid, next + 1)
        FROM    cd
        WHERE   next > 0
        ),
        lid AS
        (
        SELECT  CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
        FROM    cd
        )
SELECT  *
FROM    lid
id
3
14
15
296
5358

Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.

Let's create a sample table and build a query to do this:
Read the rest of this entry »

Written by Quassnoi

June 23rd, 2009 at 11:00 pm

Posted in SQL Server