EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘SQL Server’ Category

Ranging random values

Comments enabled. I *really* need your comment

From Stack Overflow:

Given a set of random numeric values in an SQL Server database, how do I generate a limited list of ranges where each range contains at least one value?

The ranges should not overlap and ideally have a similar amount of values in them.

This is a good job for SQL Server's analytic function NTILE.

This function, given a parameter N, assigns a rank from 1 to N to each value, just as in quadrile (4), quintile (5), percentile (100) etc.

Lower 1/N of values get a rank of 1, from 1/N to 2/N get a rank of 2, etc.

The value, though, are integer and random, and we can get gaps between the adjacent ranges.

To avoid this, we need to extend the ranges a little, so each range gets a common bound with both upper and lower neighbor. This is best done by using the middle of the gap as a common range boundary.

Say, NTILE assigned rank 1 to values from 100 to 200, rank 2 to values from 210 to 300, and rank 3 to values from 320 to 400.

Then we will use 205 as a common bound for the first two ranges, and 310 as a commond bound for the next two.

To do this, for each value selected, we will get also the previous one and the next one.

It's obvious that the value that comes before the first value in any group will be the last value in the previous group, and vice versa: the value next to the last value in a group will be the first value in the next group.

Knowing the previous and the next value for each group, we can use them to calculate the median and use it as the range boundary.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

May 17th, 2009 at 11:00 pm

Posted in SQL Server

Indexing a link table

Comments enabled. I *really* need your comment

From Stack Overflow:

A relation table is the common solution to representing a many-to-many (m:n) relationship.

In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:

CREATE TABLE t_a (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_b (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_ab (
a INT NOT NULL,
b INT NOT NULL
)

How should it be indexed to provide optimal performance in every JOIN situation?

I'll consider SQL Server as the target platform this time.

We need all WHERE, ORDER BY and GROUP BY conditions to be covered by the indexes.

On a first thought, we need to have indexes for all columns in all orders:
Read the rest of this entry »

Written by Quassnoi

May 12th, 2009 at 11:00 pm

Posted in SQL Server

Indexing VARCHAR(MAX)

with 6 comments

SQL Server 2005 introduced new datatype: VARCHAR(MAX).

This is a replacement of old TEXT that can do anything that TEXT could, and something that TEXT could not.

Something includes possibility of being queried and compared like a plain VARCHAR. A really nice feature.

SELECT  *
FROM    [20090501_max].t_bigdata
WHERE   value LIKE 'lorem ipsum dolor sit%'

Unfortunately, VARCHAR(MAX) is still far from a plain VARCHAR. For instance, it cannot serve as an index key. It can be a part of an index (the column is included to the index leaves to avoid table lookups when using the index), but not the key of an index (the index leaves are not sorted on this column).

This means that the queries using LIKE predicate on this column will be slow.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

May 1st, 2009 at 11:00 pm

Posted in SQL Server

Rating movies

Comments enabled. I *really* need your comment

From Stack Overflow:

I have the following query in SQL Server:

SELECT  TOP 50 CustomerID FROM Ratings
WHERE   CustomerID != 915
        AND MovieID IN
        (
        SELECT DISTINCT MovieID
        FROM Ratings
        WHERE CustomerID = 915
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

It is super fast, but when I try to use it in a subquery like this:

SELECT  *
FROM    Ratings
WHERE   MovieID = 1
        AND CustomerID IN
        (
        SELECT  TOP 50 CustomerID FROM Ratings
        WHERE   CustomerID != 915
                AND MovieID IN
                (
                SELECT DISTINCT MovieID
                FROM Ratings
                WHERE CustomerID = 915
                )
        )
GROUP BY
        CustomerID
ORDER BY
        COUNT(*) DESC

, it's getting slow.

Any ideas on why this is so slow and how I can speed it up?

My PRIMARY KEY is (MovieID, CustomerID) and I added a index on CustomerID

This query is in fact a simple but efficient movie rating assistant.

When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.

This is quite a fair rating system, as if a user is into Attack of the Killer Tomatoes! and Surf Nazis Must Die, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches Titanic

Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:

  1. Build a list of movies that the user 915 rated
  2. Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
  3. Select TOP 50 users who had rated most movies of those user 915 had also rated
  4. Show their opinions on movie 1

If we look into this plan, we will see two search conditions. On step 1, we need to select movie's for a given user, while on step 2 we need to select user's for a given movie.

Since we don't need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns: MovieID and CustomerID, in different order.

Since we already have a PRIMARY KEY on (MovieID, CustomerID), we'll need to add just one more index on (CustomerID, MovieID).

And since these pairs are unique, we will also make this index UNIQUE. (My article Making an index UNIQUE describes why you should always declare UNIQUE index on intrinsically unique data.)

Having two UNIQUE indexes on the same set of columns may seem a redundancy. And in fact, it is.

But let's remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.

So if two indexes will make our query even more fast, then hail redundancy!

Let's create the sample table:
Read the rest of this entry »

Written by Quassnoi

April 30th, 2009 at 11:00 pm

Posted in SQL Server

Descending indexes

with 2 comments

From Stack Overflow:

When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending.

I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way?

What difference does it make which order I choose?

This matters when the composite indexes are used.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

April 27th, 2009 at 11:00 pm

Posted in SQL Server

Making an index UNIQUE

with one comment

From Stack Overflow:

In SQL Server, if I know an index will have unique values, how will it affect performance on INSERT's or SELECT's if I declare it as such?

If the optimiser knows the index is UNIQUE, how will that affect the query plan?

I understand that specifying uniqueness can serve to preserve integrity, but leaving that discussion aside for the moment, what are the perfomance consequences?

Let's see how does a UNIQUE index perform compared to a non-UNIQUE one.

Indexes mentioned hereafter are considered to be built on a same intrinsically unique dataset. I will leave consistency issues aside, focusing only on performance details.

A UNIQUE index, contrary to popular opinion, does not have any performance drawbacks on DML compared to a plain, non-UNIQUE index.

When the engine tries to insert a new row into the indexed table, being the index UNIQUE or not, it performs the following steps:
Read the rest of this entry »

Written by Quassnoi

April 23rd, 2009 at 11:00 pm

Posted in SQL Server

Counting missing rows: SQL Server

Comments enabled. I *really* need your comment

This is the 3rd of 5 articles covering implementation of NOT IN predicate in several RDBMS'es:

Now, let's see how the same predicate is implemented in SQL Server.

We'll create sample tables just like those we used yesterday:
Read the rest of this entry »

Written by Quassnoi

April 20th, 2009 at 11:00 pm

Posted in SQL Server

Missing entries

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a case where I want to choose any database entry that have an invalid Country, Region, or Area ID.

By invalid, I mean an ID for a country or region or area that no longer exists in my tables.

I have four tables: Properties, Countries, Regions, Areas.

I was thinking to do it like this:

SELECT  *
FROM    Properties
WHERE   CountryID NOT IN 
        (
        SELECT  CountryID
        FROM    Countries
        )
        OR
        RegionID NOT IN
        (
        SELECT  RegionID
        FROM    Regions
        )
        OR
        AreaID NOT IN
        (
        SELECT  AreaID
        FROM    Areas
        )

Now, is my query right?

We have two options here:

  1. Use a LEFT OUTER JOIN with the three dimension tables (Countries, Regions and Areas), filtering out non-NULL values
  2. Use a NOT IN / NOT EXISTS approach shown above

LEFT JOIN intuitively seems to be more elegant and more efficient than subqueries. But is it really?
Read the rest of this entry »

Written by Quassnoi

April 14th, 2009 at 11:00 pm

Posted in SQL Server

Matching whole sets

Comments enabled. I *really* need your comment

From Stack Overflow:

Given the following:

declare @a table
(
    pkid int,
    value int
)

declare @b table
(
    otherID int,
    value int
)

insert into @a values (1, 1000)
insert into @a values (1, 1001)
insert into @a values (2, 1000)
insert into @a values (2, 1001)
insert into @a values (2, 1002)

insert into @b values (-1, 1000)
insert into @b values (-1, 1001)
insert into @b values (-1, 1002)

, how do I query for all the values in @a that completely match up with @b?

{@a.pkid = 1, @b.otherID = -1} would not be returned (only 2 of 3 values match)

{@a.pkid = 2, @b.otherID = -1} would be returned (3 of 3 values match)

We need to compare each set of values from @a to each set of @b and return the id's of non-matching sets.

Two sets are considered matching when for each row from the first set there is an equal row from the second set and vise versa. That is, both the values and the count of the rows should match in both sets.

First thing that comes to mind is to calculate COUNT(*)'s of both sets, then calculate the COUNT(*) for the result of both sets joined on value and make sure all three are equal.

But we don't actually need to COUNT(*) all values. We just need to find out the first row from either of the subsets that has no match in another subset. As soon as we find such a row, we can tell that the sets don't match; if we don't find one, then the sets do match.
Read the rest of this entry »

Written by Quassnoi

March 31st, 2009 at 11:00 pm

Posted in SQL Server

Deleting duplicates

Comments enabled. I *really* need your comment

Microsoft Knowledge Base has an article KB139444 on how to delete duplicate rows from a table that has no primary key.

Though it works, it's quite an overkill to delete duplicate rows with grouping, temporary tables, deleting and reinserting.

With SQL Server 2005 and above, there is much more elegant solution.

Let's create a table and fill it with duplicate records:

CREATE TABLE t_duplicate (id INT NOT NULL, value VARCHAR(50) NOT NULL)
GO

SET NOCOUNT ON

BEGIN TRANSACTION

DECLARE @i INT
SET @i = 1
WHILE @i < 5000
BEGIN
        INSERT
        INTO    t_duplicate
        VALUES  (@i / 1000 + 1, 'Value ' + CAST(@i AS VARCHAR))
        SET @i = @i + 1
END

COMMIT

SELECT  *
FROM    t_duplicate
&#91;/sourcecode&#93;

<table class="terminal">
<tr><th>id</th><th>value</th></tr>
<tr><td>1</td><td>Value 1</td></tr>
<tr><td>1</td><td>Value 2</td></tr>
<tr><td>1</td><td>Value 3</td></tr>
<tr class="break"><td colspan="100"></td></tr>
<tr><td>5</td><td>Value 4997</td></tr>
<tr><td>5</td><td>Value 4998</td></tr>
<tr><td>5</td><td>Value 4999</td></tr>
</table>

And now let's delete the duplicates:


WITH    q AS
        (
        SELECT  d.*,
                ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn
        FROM    t_duplicate d
        ) 
DELETE
FROM    q
WHERE   rn > 1

SELECT  *
FROM    t_duplicate
id value
1 Value 1
2 Value 1000
3 Value 2000
4 Value 3000
5 Value 4000

Done in a single query.

Written by Quassnoi

March 14th, 2009 at 11:00 pm

Posted in SQL Server