EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘SQL Server’ Category

Generating row placeholders: SQL Server

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vladimir asks:

I have two tables: one contains document headers, the other one contains document items.

The document header defines how many items should be in this document (usually 10 or 15, but this can vary).

I need to generate a list of documents and a list of document items to feed them to a third party control. Normally, this could be easily achieved by doing a SELECT * from both tables, but there is a problem. The control requires that the number of items for each document is equal to the value of the columns in the document header.

That is, if document header says 15 items, there should be exactly 15 records for this document in the items rowset, even if there are less items in the actual table. It's OK if the extra items have a value of NULL.

I tried to use LEFT JOIN and FULL JOIN but cannot figure out how to use them.

This is quite a common problem in SQL world.

SQL as such has no means to generate sets, only to transform them. In classical SQL, if we have a relation that contains 2 records on input and need to have a relation that contains 16 records on output, then we will need to do at least 4 cross joins — just to have enough rows to begin with.

There are two solutions to that problem.

First is having a dummy rowset handy: create a table of 1,000,000 records or use some kind of a system table whose only purpose is to provide enough rows (the actual values of them can be completely ignored).

Second solution is to generate rows on the fly using system-provided means. This is impossible in MySQL, however, all other major systems can do it using generate_series (PostgreSQL, the only system that has a native tool specially for this purpose); CONNECT BY query (Oracle); or, in your case, a recursive CTE (SQL Server).

Recursive queries are somewhat less efficient than stored tables, since they require looping and therefore having much logical reads, while reading from a dummy table requires but a few table scans.

However, the recursive queries also have several advantages:

  • They are always available, even if you querying a database that lacks a dummy table and you don't have a write access to the database
  • They always have enough records
  • They have consistent plans. If the dummy table contains lots of records and the optimizer chooses a fullscan, this can kill performance

That's why I'll base my solution on the recursive CTE.

The main idea to generate a list of document items is simple:

  • Each document header we should replicate exactly as many times as the header says (using a recursive CTE), with an ordinal number
  • We also should assign an ordinal number to each document item, using ROW_NUMBER()
  • Finally, we should just left join the two resultsets

The query will look like this:
Read the rest of this entry »

Written by Quassnoi

October 21st, 2009 at 11:00 pm

Posted in SQL Server

Constraints and the optimizer in SQL Server: FOREIGN KEY

Comments enabled. I *really* need your comment

Continuing on from the yesterday's article.

I was wondering: do the constraints declared on tables in SQL Server affect the decisions made by the optimizer?

SQL Server allows to define the following constraints on the columns:

  • PRIMARY KEY
  • UNIQUE
  • FOREIGN KEY
  • CHECK
  • DEFAULT

Today we will see how FOREIGN KEY affect the plans.

A FOREIGN KEY guarantees that every value of the column constrained with it is contained in a PRIMARY KEY or UNIQUE column of another table.

This fact can be used to simpify joins on these columns and using IN clause.

Let's create the sample tables:
Read the rest of this entry »

Written by Quassnoi

October 15th, 2009 at 11:00 pm

Posted in SQL Server

Constraints and the optimizer in SQL Server: PRIMARY KEY and UNIQUE

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Mitch asks:

I was wondering: do the constraints declared on tables in SQL Server affect the decisions made by the optimizer?

SQL Server allows to define the following constraints on the columns:

  • PRIMARY KEY
  • UNIQUE
  • FOREIGN KEY
  • CHECK
  • DEFAULT

Since these constraints imply some restrictions on data, it gives a theoretical possibility to use these restrictions while optimizing the queries so that some extra checks (which had already been made by the constraints) can be skipped.

SQL Server does use these possibilities and optimizes its plans taking the constraints into account.

In this article we will see how SQL Server uses PRIMARY KEY and UNIQUE to optimize the plans.

PRIMARY KEY and UNIQUE

Let's create a pair of sample tables:
Read the rest of this entry »

Written by Quassnoi

October 14th, 2009 at 11:00 pm

Posted in SQL Server

Running root mean square: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

Lets say I have the following in a SQL Server table:

Value
1
3
2
6
3

I need to get the differences of each of these numbers (in order), then square them, sum them, divide by number of values and take the square root.

This value is called root mean square. It is widely used in statistics.

However, we should calculate it for the differences between the values, not for the values themselves. In Oracle, the differences between rows could be calculated easilty using LAG, an analytical function that returns a previous row. Unfortunately, SQL Server lacks this function.

It could be emulated using a subquery or CROSS APPLY along with ORDER BY and TOP, but this makes the query less elegant and less efficient for large tables (since using TOP in subqueries and CROSS APPLY forces NESTED LOOPS).

It is better to assign the row number to each row (using ROW_NUMBER) and then just make an equijoin between two consecutive rows. This looks much better and the optimizer is free to choose any join method, including Hash Join which is most efficient when large percent of the table rows participates in a join, or a Merge Join which will be also efficient, given that ROW_NUMBER() tends to output the ordered resultsets and SQL Server can utilize this fact.

Here's the query:

WITH    nums AS
(
SELECT  value, ROW_NUMBER() OVER (ORDER BY orderer) AS rn
FROM    source
)
SELECT  SQRT(AVG(POWER(np.value - nn.value, 2)))
FROM    nums np
JOIN    nums nn
ON      nn.rn = np.rn + 1

And let's try it on the sample data:

WITH    source (orderer, value) AS
(
SELECT  10, 1
UNION ALL
SELECT  20, 3
UNION ALL
SELECT  30, 2
UNION ALL
SELECT  40, 6
UNION ALL
SELECT  50, 3
),
nums AS
(
SELECT  value, ROW_NUMBER() OVER (ORDER BY orderer) AS rn
FROM    source
)
SELECT  SQRT(AVG(POWER(np.value - nn.value, 2)))
FROM    nums np
JOIN    nums nn
ON      nn.rn = np.rn + 1

2.6457513110645907
1 row fetched in 0.0001s (0.0007s)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 
  |--Compute Scalar(DEFINE:([Expr1027]=sqrt(CONVERT_IMPLICIT(float(53),[Expr1026],0))))
       |--Compute Scalar(DEFINE:([Expr1026]=CASE WHEN [Expr1038]=(0) THEN NULL ELSE [Expr1039]/CONVERT_IMPLICIT(int,[Expr1038],0) END))
            |--Stream Aggregate(DEFINE:([Expr1038]=COUNT_BIG(power([Union1011]-[Union1024],(2.000000000000000e+000))), [Expr1039]=SUM(power([Union1011]-[Union1024],(2.000000000000000e+000)))))
                 |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([Expr1025])=([Expr1033]), RESIDUAL:([Expr1025]=([Expr1012]+(1))))
                      |--Sequence Project(DEFINE:([Expr1025]=row_number))
                      |    |--Compute Scalar(DEFINE:([Expr1035]=(1)))
                      |         |--Segment
                      |              |--Sort(ORDER BY:([Union1023] ASC))
                      |                   |--Constant Scan(VALUES:(((10),(1)),((20),(3)),((30),(2)),((40),(6)),((50),(3))))
                      |--Compute Scalar(DEFINE:([Expr1033]=[Expr1012]+(1)))
                           |--Sequence Project(DEFINE:([Expr1012]=row_number))
                                |--Compute Scalar(DEFINE:([Expr1037]=(1)))
                                     |--Segment
                                          |--Sort(ORDER BY:([Union1010] ASC))
                                               |--Constant Scan(VALUES:(((10),(1)),((20),(3)),((30),(2)),((40),(6)),((50),(3))))

The query works, uses the Merge Join and gives a correct result.

Written by Quassnoi

October 13th, 2009 at 11:00 pm

Posted in SQL Server

Minimum and maximum on bit fields: SQL Server

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query in SQL Server 2005:

SELECT  *
FROM    tblJobs AS j
-- joining other tables here
WHERE   Closed = 0
AND Invoiced = 0
AND Active = 1
AND DepartmentId <> 2
AND dbo.fncIsAllPointsDelivered(JobID) = 1

This query is taking too long to run, and I know the problem is the UDF: dbo.fncIsAllPointsDelivered(J.JobID) = 1

The SQL for the UDF is here:

DECLARE @DetailCount int
DECLARE @TrackingCount int

SELECT  @DetailCount = COUNT(*)
FROM    tblLoadDetails
WHERE   JobId = @JobId

SELECT  @TrackingCount = COUNT(*)
FROM    tblLoadDetails
WHERE   JobId = @JobId
AND Delivered = 1

IF (@DetailCount = @TrackingCount AND @DetailCount > 0)
RETURN 1

RETURN 0

All of this runs blazingly fast unless the job has a large number of load details in it.

I am trying to think of a way to either make the UDF faster or get rid of the need for the UDF, but I am at a loss.

UDF is definitely a problem here, since the conditions based on the UDF's are not sargable, that is an index cannot be used to search for them.

Let's look closed into this UDF.

Given a @JobID, the function searches the load details for this job. For the function to succeed, all load details should have the state of Delivered, and there must be at least one load detail for this job.

The UDF can be rewritten using IN and NOT IN conditions. The first condition makes sure that there is at least one load detail for the given job, the second one ensures that all load details have state Delivered.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

October 12th, 2009 at 11:00 pm

Posted in SQL Server

Date range with possible NULL values

with 2 comments

From Stack Overflow:

Using SQL Server 2008.

I have a stored proc which has start and end date as input parameters for date range.

Looking for a single SQL query which has a BETWEEN predicate on start and end date in the WHERE clause which can handle both cases where the dates are either both NULL or both have values.

This can be done by writing an IS NULL check in the expression, like this:

WHERE   (@startDate IS NULL OR date >= @startDate)
AND (@endDate IS NULL OR date >= @endDate)

or by using a COALESCE on the value of the date itself (so that the range always matches):

WHERE   date BETWEEN ISNULL(@startDate, date) AND ISNULL(@endDate, date)

However, none of these conditions uses an index on date efficiently.

Let's create a sample table ans see:
Read the rest of this entry »

Written by Quassnoi

October 5th, 2009 at 11:00 pm

Posted in SQL Server

Adjacency list vs. nested sets: SQL Server

with 5 comments

Continuing the series:

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

For detailed explanations of the terms, see the first article in the series:

Now, let's see what's better for SQL Server.

We will create a single table that holds both adjacency list data and nested sets data, with 8 levels of nesting, 5 children of each parent node and 2,441,405 records:
Read the rest of this entry »

Written by Quassnoi

September 25th, 2009 at 11:00 pm

Posted in SQL Server

Finding sets within a time frame

Comments enabled. I *really* need your comment

From Stack Overflow:

I've got a set of data that contains users, their awards and the time they are granted.

When a user earns an award, the award and user is logged along with a date.

What I'd like is a query that finds a set of 3 achievements that were earned within 5 minutes of each other by the same user.

Any thoughts on how to accomplish this?

I'll go a step further and try to build a general solution to find sets of N values within a given time frame.

We should return all rows that are parts of all 5-minute time frames containing at least N values. These timeframes can overlap.

We should build such a timeframe for each record, make sure that there are at least N values in this timeframe, and if yes, return these values (applying DISTINCT to get rid of the duplicates).

Let's create a sample table and build a query to do this:

Read the rest of this entry »

Written by Quassnoi

September 22nd, 2009 at 11:00 pm

Posted in SQL Server

Rounding numbers preserving their sum

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm looking for a way to round every value contained in a column without changing the total sum (which is known to be an integer, or very close to an integer because of float number precision).

Rounding every value to the nearest integer won't work (ex : (1.5, 1.5) will be rounded to either (1, 1) or (2, 2))

It's quite easy to do this using several requests (e. g. storing the original sum, rounding, computing the new sum, and updating as many rows as needed to go back to the original sum), but this is not a very elegant solution.

Is there a way to do this using a single SQL request?

I'm using SQL Server 2008, so solutions taking advantage of this specific vendor are welcome.

I'm looking for a request minimizing the differences between the old values and the new ones. In other words, a value should never be rounded up if a greater value has been rounded down, and vice-versa

We have two restrictions here:

  1. The sum of the rounded numbers should be equal to that of the unrounded ones.
  2. When rounding, no number with a greater fractional part can be rounded down if there is a number with a smaller fractional part that had been rounded up.

We cannot round (1.4, 1.4, 1.2) to (1, 1, 2), since 1.4 would be rounded down while 1.2 (with smaller fractional part) is rounded up. However, we can round them to (1, 2, 1) or (1, 2, 1), since nothing prevents us from rounding numbers with equal fractional part differently.

Given this, we should build a query that would take a set of numbers and round them taking these two restrictions into account.

It is easily seen than to satisfy the second restriction a certain rounding threshold should exist: all numbers with the fractional part greater than this threshold are rounded up, those with the fractional part smaller than then threshold are rounded down, those with the fractional part equal to the threshold can be rounded both ways.

However, calculating this threshold is, first, quite a difficult task and, second, we still don't know what to do with the numbers having the fractional part equal to the threshold.

Instead, we use another approach.

Let's create a sample tables and show this approach:

Read the rest of this entry »

Written by Quassnoi

September 21st, 2009 at 11:00 pm

Posted in SQL Server

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

with 26 comments

This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right r
)

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

Differences between the methods

These methods are quite different.

First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right

LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right's actual values.

Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right's row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row's value is not equal to one of those in t_right.

The same holds for NOT EXISTS. Since it's a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.

NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).

Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.

NOT IN, however, behaves differently.

IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:

  • TRUE is returned when the non-NULL value in question is found in the list
  • FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
  • NULL is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.

However, NOT IN is different, since negation of NULL is NULL as well.

That's why NOT IN condition will never hold for any list with a NULL value in it.

  • If a row is found in the list, IN will return TRUE and NOT IN, therefore, will return FALSE
  • If a row is not found in the list, IN will return NULL, and NOT IN on its turn will also return NULL

Both conditions will of course be filtered out by the WHERE clause.

Let's illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_right r
WHERE   r.value = l.value
)

value
1
NULL
2 rows fetched in 0.0001s (0.0006s)

This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.

WITH    t_left AS
(
SELECT  1 AS value
UNION ALL
SELECT  NULL
),
t_right AS
(
SELECT  2 AS value
UNION ALL
SELECT  NULL
)
SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
(
SELECT  value
FROM    t_right
)

value
0 rows fetched in 0.0001s (0.0005s)

This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.

IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.

However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.

Read the rest of this entry »

Written by Quassnoi

September 15th, 2009 at 11:00 pm

Posted in SQL Server