SQL Server Archives - Page 6 of 8 - EXPLAIN EXTENDED at EXPLAIN EXTENDED

Archive for the ‘SQL Server’ Category

Indexing tiny tables

Hypothetically, in a SQL Server database, if I have a table with two INT fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two INT fields overcomes the overhead imposed by said indexes?

Very nice question.

Imagine we have a table table1 (col INT NOT NULL, value1 INT NOT NULL, value2 VARCHAR(100) NOT NULL, …) with an index on col.

In SQL Server, an index can improve the following things.

Index lookups (range scans). An index can be used to limit the number of rows scanned:
```
SELECT  *
FROM    table1 t1
WHERE   t1.col BETWEEN @start AND @end
```
Secondary data source. This query will get all data it need from the index:
```
SELECT  t1.col
FROM    table1 t1
```

Ordered rowsets. This query will not need to sort:

SELECT  *
FROM    table1 t1
ORDER BY
        t1.col

Efficient MIN / MAX:
```
SELECT  MIN(col)
FROM    table1 t1
```

The latter case is probably the best illustration of the fact that even a table of 2 rows can benefit from creating an index.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

June 22nd, 2009 at 11:00 pm

Posted in SQL Server

Efficient EXISTS

Comments enabled. I *really* need your comment

In my previous article:

IN vs. JOIN vs. EXISTS

I compared efficiency of different methods to check for existence of a value in a subquery resultset.

These methods included IN and EXISTS predicates.

It was shown that in SQL Server, a poorly written EXISTS can be very slow, while an IN predicate is almost always efficiently optimized using one of the numerous SEMI JOIN methods.

However, a properly written EXISTS predicate can outperform IN.

Let's create a sample table and see how:
Read the rest of this entry »

Written by Quassnoi

June 17th, 2009 at 11:00 pm

Posted in SQL Server

IN vs. JOIN vs. EXISTS

with 16 comments

From Stack Overflow:

I'm wanting to select rows in a table where the primary key is in another table.

I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005:
SELECT  *
FROM    a
WHERE   a.c IN
        (
        SELECT  d
        FROM    b
        )
SELECT  a.*
FROM    a
JOIN    b
ON      a.c = b.d
Is there any significant performance difference between these two SQL queries with a large dataset (i. e. millions of rows)?

Before we begin I should note that the queries provided by the author are not similar in fact and could produce different resultsets.

If the table used in the subquery returns a value twice, a JOIN will also return the matching rows twice, while an IN condition will return them only once.

That's why it's more correct to compare to the following query:

SELECT  a.*
FROM    a
JOIN    (
        SELECT  DISTINCT d
        FROM    b
        ) bo
ON      a.c = bo.d

Now, à nos moutons.

The folk wisdom advises against using IN constructs in favor of JOIN's because they say that the subquery will be executed many times in a loop.

Let's create some sample tables and see if it's really so:
Read the rest of this entry »

Written by Quassnoi

June 16th, 2009 at 11:00 pm

Posted in SQL Server

Group based identity

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm working with an application that adds a new row to the database, based on the last row meeting a certain criteria.

Here is an over-simplified visualization:

A1

A2

A3

B1

B2

Using the visualization above, a web page loads up the highest B value, which is 2.

Then, after some time, it wants to insert B3, the next record in the series.

However, it has to check to make sure that someone else didn't do the same thing.

In actuality, each row corresponds to a place that a piece of equipment visited in a factory.

The page (not easy to rewrite), currently moves the equipment to the next step.

Is there a standard pattern for dealing with this type of problem, or do I simply need to lock the table?

This question combines two other frequently asked questions:

How do I make group-based identity (i. e. keep several identity series identified by the value of a certain column)
How do I ensure there will be no identity gaps?

While there are numerious solutions that involve some kind of selecting MAX(column) and updating the row inside a transaction, there is a more simple way using linked lists.

We need to create the table as following:
Read the rest of this entry »

Written by Quassnoi

June 15th, 2009 at 11:00 pm

Posted in SQL Server

Flattening timespans: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.

To make things a bit clearer, take a look at the sample data for 03.06.2009:

The following timespans are overlapping or contiunous and need to merge into one timespan:

date start stop

2009.06.03 05:54:48:000 10:00:13:000

2009.06.03 09:26:45:000 09:59:40:000

The resulting timespan would be from 05:54:48 to 10:00:13.

Since there's a gap between 10:00:13 and 10:12:50, we also have the following timespans:

date start stop

2009.06.03 10:12:50:000 10:27:25:000

2009.06.03 10:13:12:000 11:14:56:000

2009.06.03 10:27:25:000 10:27:31:000

2009.06.03 10:27:39:000 13:53:38:000

2009.06.03 11:14:56:000 11:15:03:000

2009.06.03 11:15:30:000 14:02:14:000

2009.06.03 13:53:38:000 13:53:43:000

2009.06.03 14:02:14:000 14:02:31:000

which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.

Any solution, be it SQL or not, is appreciated.

date	start	stop
2009.06.03	05:54:48:000	10:00:13:000
2009.06.03	09:26:45:000	09:59:40:000

date	start	stop
2009.06.03	10:12:50:000	10:27:25:000
2009.06.03	10:13:12:000	11:14:56:000
2009.06.03	10:27:25:000	10:27:31:000
2009.06.03	10:27:39:000	13:53:38:000
2009.06.03	11:14:56:000	11:15:03:000
2009.06.03	11:15:30:000	14:02:14:000
2009.06.03	13:53:38:000	13:53:43:000
2009.06.03	14:02:14:000	14:02:31:000

This is also quite a common task.

The algorithm here is simple:

Order all timespans by start
Initialize a variable which will hold a running maximum of the timespans' stop
Take the timespans one by one, and for each next timespan, check its start against a running maximum of the previous stop's. If the current start is greater than the maximum of previous stop's, it means that all timespans that started earlier have been ended, and we have a gap
Update the running maximum

Unfortunately, SQL Server does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.

This is one of quite rare cases when cursor in SQL Server work faster than set-based operations.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

June 11th, 2009 at 11:00 pm

Posted in SQL Server

Joining unjoinable

Comments enabled. I *really* need your comment

From Stack Overflow:

I want to combine two tables into one.

Let say I have:

Table1:

ID Name

1 A

2 B

3 C

Table2:

ID Name

4 D

5 E

6 F

I want to make Table3:

Name1 Name2

A D

B E

C F

How can I do this in SQL Server?

ID	Name
1	A
2	B
3	C

ID	Name
4	D
5	E
6	F

Name1	Name2
A	D
B	E
C	F

This a certainly a task for a JOIN, since we have two tables here and want to match any row from the first table with another row from the second table.

But what condition do we use for a JOIN?

The ID's of the rows seem to be arbitrary and have no special meaning. They could as well be sparse and randomly distributed, so it's impossible to build a formula which corresponds one ID to another. Same is true for the names.

However, there actually is a condition which we could JOIN on.

Since the author mentioned there is same number of rows in both tables, we could use the ROW_NUMBER() as a JOIN condition.

We just need to wrap both these queries into CTE's which select a ROW_NUMBER() along with the table data:

WITH    table1 AS
        (
        SELECT  1 AS id, 'A' AS name
        UNION ALL
        SELECT  2 AS id, 'B' AS name
        UNION ALL
        SELECT  3 AS id, 'C' AS name
        ),
        table2 AS
        (
        SELECT  4 AS id, 'D' AS name
        UNION ALL 
        SELECT  5 AS id, 'E' AS name
        UNION ALL
        SELECT  6 AS id, 'F' AS name
        ),
        q1 AS
        (
        SELECT  table1.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
        FROM    table1
        ),
        q2 AS
        (
        SELECT  table2.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
        FROM    table2
        )
SELECT  q1.name, q2.name
FROM    q1
JOIN    q2
ON      q2.rn = q1.rn

name	name
A	D
B	E
C	F
3 rows fetched in 0.0002s (0.0264s)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

Written by Quassnoi

June 10th, 2009 at 11:00 pm

Posted in SQL Server

Longest common prefix: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have some data:

id ref

1 3536757616

1 3536757617

1 3536757618

2 3536757628

2 3536757629

2 3536757630

and want to get the result like this:

id result

1 3536757616/7/8

2 3536757629/28/30

Essentially, the data should be aggregated on id, and the ref's should be concatenated together and separated by a / (slash), but with longest common prefix removed.

id	ref
1	3536757616
1	3536757617
1	3536757618
2	3536757628
2	3536757629
2	3536757630

id	result
1	3536757616/7/8
2	3536757629/28/30

I've already encoutered this problem several times, so I'll try to cover solutions for all RDBMS's my blog is about:

I hope this will be interesting as approaches will differ significantly for all four systems.

Today is SQL Server time.

I won't create sample tables here, since I'm demonstrating the principle. Instead, I'll just use dynamically generated data.

What do we need to do here is:

Find least common prefix for each group and its length
Cut off the prefix of each but the first ref, using SUBSTRING
Concatenate the strings using FOR XML

Steps 2 and 3 are quite simple, but the first one needs some effort.

SQL Server doesn't provide a function to find the longest common prefix (LCP), so we'll have to implement it.

With some limitations, it may be done using pure SQL, no UDF's.
Read the rest of this entry »

Written by Quassnoi

June 4th, 2009 at 11:00 pm

Posted in SQL Server

Selecting records unique on each column

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to select 3 columns from a table, but I need each value from any column to be unique in the resultset for this column.

This query:
SELECT DISTINCT TOP 10 a, b, c
FROM   x
will return 10 distinct sets.

How do I do it?

Here we will need to make sure that for each column in a record we are going to return, no previous record returned so far does not contain the same value of the column.

Let's create a sample table and construct the query:
Read the rest of this entry »

Written by Quassnoi

June 3rd, 2009 at 11:00 pm

Posted in SQL Server

Generating XML in subqueries

with one comment

From Stack Overflow:

I have a query that looks like
SELECT  P.Column1, P.Column2, P.Column3,
        (
        SELECT  A.ColumnX, A.ColumnY
        FROM    dbo.TableReturningFunc1(@StaticParam1, @StaticParam2) AS A
        WHERE   A.Key = P.Key
        FOR XML AUTO, TYPE  
        ),
        (
        SELECT  B.ColumnX, B.ColumnY,
        FROM    dbo.TableReturningFunc2(@StaticParam1, @StaticParam2) AS B
        WHERE   B.Key = P.Key
        FOR XML AUTO, TYPE  
        )
FROM    (
        /* joined tables here */
        ) AS P
FOR XML AUTO, ROOT('ROOT')
P has about 5,000 rows.

A and B, about 4,000 rows each.

This query has a runtime performance of more than 10 minutes.

Changing it to this, however:
SELECT  P.Column1,
        P.Column2,
        P.Column3
INTO    #P
FROM    (
        /* joined tables here
        ) p

SELECT  A.ColumnX,
        A.ColumnY
INTO    #A     
FROM    dbo.TableReturningFunc1(@StaticParam1, @StaticParam2) AS A

SELECT  B.ColumnX, B.ColumnY
INTO    #B     
FROM    dbo.TableReturningFunc2(@StaticParam1, @StaticParam2) AS B

SELECT  P.Column1, P.Column2, P.Column3,
        (
        SELECT  A.ColumnX, A.ColumnY
        FROM    #A AS A
        WHERE   A.Key = P.Key
        FOR XML AUTO, TYPE  
        ),
        (
        SELECT  B.ColumnX, B.ColumnY,
        FROM    #B AS B
        WHERE   B.Key = P.Key
        FOR XML AUTO, TYPE  
        )
FROM    #P AS P
FOR XML AUTO, ROOT('ROOT')
, has a performance of about 4 seconds.

This makes not a lot of sense, as it would seem the cost to insert into a temp table and then do the join should be higher by default.

My inclination is that SQL is doing the wrong type of "join" with the subquery, but maybe I've missed it, there's no way to specify the join type to use with correlated subqueries.

Is there a way to achieve this without using #temptable's or @table_variable's via indexes and/or hints?

The reason for this is obvious: SQL Server needs to reevaluate the function each time the subquery is called, which is long.

When the subquery result is cached in the temp table, SQL Server will just build an index over this table in an Eager Spool, and use this index in the joins.

But is it possible to increase performance without using temporary tables?

Let's create sample functions, tables and see:
Read the rest of this entry »

Written by Quassnoi

May 28th, 2009 at 11:00 pm

Posted in SQL Server

Ordering columns independently

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table with 5 columns in it.

What's the easiest way to select all rows, but where each column is individually randomized?

All I can think of is to select each column separately, along with

ROW_NUMBER() OVER (ORDER BY NEWID()) AS lookup

, and then join each column back together on lookup.

Is there an easier way?

Though it's possible to do this by means of JOIN'ing, there is an easier way.

Since all SQL operations are row-level, we will need to make 5 independent ordered sets, which we later can use to order each column set. This implies having 5 times as much rows as there are rows in the original table.

But instead of JOIN'ing, we can use SQL Server's clause which maps rows to columns, which is called PIVOT.

We need to do the following:

Generate 5 sets of column values and UNION ALL them.
For each set, select a ROW_NUMBER() OVER (ORDER BY NEWID()) along with each row. It will be unique for each row in any set and will give a row identifier for PIVOT'ing.
For each set, select a set identifier: just a string with column name. This will give a column identifier for PIVOT'ing
Finally, PIVOT the column values: they will get into appropriate row and column, based on ROW_NUMBER() and the set name

Since PIVOT requires aggregation (in case row and column identifiers are not unique), we need some aggregation function, but our query is designed so that there is exactly one row with a given ROW_NUMBER() and set name, so any aggregation function will do.

We will use MIN.

Now, let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

May 22nd, 2009 at 11:00 pm

Posted in SQL Server

EXPLAIN EXTENDED

Archive for the ‘SQL Server’ Category

Indexing tiny tables

Efficient EXISTS

IN vs. JOIN vs. EXISTS

Group based identity

Flattening timespans: SQL Server

Joining unjoinable

Longest common prefix: SQL Server

Selecting records unique on each column

Generating XML in subqueries

Ordering columns independently

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

Archive for the ‘SQL Server’ Category

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow