Archive for the ‘SQL Server’ Category
Included columns
Answering questions asked on the site.
Peter asks:
I’ve been looking around for explanation of how
INCLUDEDcolumns in SQL Server indexes work and found myself totally confused.Could you please explain what they are for?
Thanks!
In a couple of words, an INCLUDED column is a column which is included into the index payload but not the key, i. e. the index leaves are not sorted on this column.
This can save some DML overhead on columns which are filtered on non-ranged conditions.
If the index leaves are less in size than the table rows, it can be more efficient to get the column value from the index rather than from the table to filter on it.
This can help to get rid of expensive Key Lookups or Clustered Index Scans.
In one of applications I developed, I had a large table that kept the restaurant orders.
The table was very large both in row count and the row size: a little less than a hundred columns.
Among other fields there was a DATETIME column (the order date) and an INT column (the client id)
One of the most used reports required that I calculated the average price of the orders for a certain set of clients within a certain period, say, for June.
For the sake of brevity, I’ll replace all these numerous columns with a single NVARCHAR(4000) column.
This will make the table definition more brief and comprehensible, leaving the table rows as large.
Here’s what the table looked like:
Read the rest of this entry »
Building permutations: SQL Server
From Stack Overflow:
I am doing a testing scenario.
There are 6 people in each
site:
Site People Site 1 A, B, C, D, E, F Site 2 G, H, I, J, K, L Site 3 M, N, O, P, Q, R Site 4 S, T, U, V, W, X I want to write a query that can suggests me datewise the people who can test the site, two at a time.
Here are the rules:
- If a person has tested his site on Day 1, then his turn should come on Day 4 and not before that.
This, however, is applicable for current week. So if A and D test a site on 22nd, B and E test it on 23rd and C and F test on 24th, then for this week, A and D can test the site only on 25th. Sunday is a holiday.- Every week, the pair should change. Rule 1 is still applicable with the new pair.
- A person belonging to a particular site cannot test other sites.
To build these rules, we need to do the following:
- For each site, split 6 people into 3 pairs.
- Assign each pair to the appropriate day of week (pair 1 for Monday and Thursday, pair 2 for Tuesday and Friday, pair 3 for Wednesday and Saturday)
- For each new week, split the 6 people new way so that no pairs persist.
Step 3 is most interesting here, since it requires building permutations.
To build the pairs, we need to arrange 6 people and take the pairs from the arrangements.
There are 6! = 720 distinct ways to arrange 6 people.
But for our task the following pairs:
(1, 2)and(2, 1)(3, 4)and(4, 3)(5, 6)and(6, 5)
are considered equal. That’s why we can use only 720 / 8 = 90 arrangements which yield non-repeating pairs.
SQL Server does not provide a native way to build all permutations of a set. We will have to emulate it using recursive CTE‘s.
Let’s create a table with required data:
Read the rest of this entry »
IN with a comma separated list: SQL Server
From Stack Overflow:
How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?
Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.
One should pass lists in a table variable rather than in a set of comma separated values.
However, there are legacy applications, ORM‘s and these kinds of stuff that just leave you no choice.
MySQL provides a nice function find_in_set designed exactly for this task.
SQL Server does not directly support an IN predicate for CSV, neither it provides find_in_set. It needs a rowset for IN predicate to work.
To expand a comma separated list into a rowset, we will use a recursive CTE.
In the CTE, we should search for the position of every next comma and return it.
Then we can just take the substring between two commas and cast it into an INT:
DECLARE @lid VARCHAR(100)
SET @lid = '3, 14, 15, 296, 5358';
WITH cd AS
(
SELECT 1 AS first, CHARINDEX(',', @lid, 1) AS next
UNION ALL
SELECT next + 1, CHARINDEX(',', @lid, next + 1)
FROM cd
WHERE next > 0
),
lid AS
(
SELECT CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id
FROM cd
)
SELECT *
FROM lid
| id |
|---|
| 3 |
| 14 |
| 15 |
| 296 |
| 5358 |
Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.
Let’s create a sample table and build a query to do this:
Read the rest of this entry »
Indexing tiny tables
From Stack Overflow:
Hypothetically, in a SQL Server database, if I have a table with two
INTfields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the twoINTfields overcomes the overhead imposed by said indexes?
Very nice question.
Imagine we have a table table1 (col INT NOT NULL, value1 INT NOT NULL, value2 VARCHAR(100) NOT NULL, …) with an index on col.
In SQL Server, an index can improve the following things.
-
Index lookups (range scans). An index can be used to limit the number of rows scanned:
SELECT * FROM table1 t1 WHERE t1.col BETWEEN @start AND @end
- Secondary data source. This query will get all data it need from the index:
SELECT t1.col FROM table1 t1
- Ordered rowsets. This query will not need to sort:
SELECT * FROM table1 t1 ORDER BY t1.col - Efficient
MIN/MAX:SELECT MIN(col) FROM table1 t1
The latter case is probably the best illustration of the fact that even a table of 2 rows can benefit from creating an index.
Let’s create the sample tables:
Read the rest of this entry »
Efficient EXISTS
In my previous article:
I compared efficiency of different methods to check for existence of a value in a subquery resultset.
These methods included IN and EXISTS predicates.
It was shown that in SQL Server, a poorly written EXISTS can be very slow, while an IN predicate is almost always efficiently optimized using one of the numerous SEMI JOIN methods.
However, a properly written EXISTS predicate can outperform IN.
Let’s create a sample table and see how:
Read the rest of this entry »
IN vs. JOIN vs. EXISTS
From Stack Overflow:
I’m wanting to select rows in a table where the primary key is in another table.
I’m not sure if I should use a
JOINor theINoperator in SQL Server 2005:SELECT * FROM a WHERE a.c IN ( SELECT d FROM b )SELECT a.* FROM a JOIN b ON a.c = b.dIs there any significant performance difference between these two SQL queries with a large dataset (i. e. millions of rows)?
Before we begin I should note that the queries provided by the author are not similar in fact and could produce different resultsets.
If the table used in the subquery returns a value twice, a JOIN will also return the matching rows twice, while an IN condition will return them only once.
That’s why it’s more correct to compare to the following query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
Now, à nos moutons.
The folk wisdom advises against using IN constructs in favor of JOIN‘s because they say that the subquery will be executed many times in a loop
.
Let’s create some sample tables and see if it’s really so:
Read the rest of this entry »
Group based identity
From Stack Overflow:
I’m working with an application that adds a new row to the database, based on the last row meeting a certain criteria.
Here is an over-simplified visualization:
A1 A2 A3 B1 B2 Using the visualization above, a web page loads up the highest B value, which is 2.
Then, after some time, it wants to insert B3, the next record in the series.
However, it has to check to make sure that someone else didn’t do the same thing.
In actuality, each row corresponds to a place that a piece of equipment visited in a factory.
The page (not easy to rewrite), currently moves the equipment to the next step.
Is there a standard pattern for dealing with this type of problem, or do I simply need to lock the table?
This question combines two other frequently asked questions:
- How do I make group-based identity (i. e. keep several identity series identified by the value of a certain column)
- How do I ensure there will be no identity gaps?
While there are numerious solutions that involve some kind of selecting MAX(column) and updating the row inside a transaction, there is a more simple way using linked lists.
We need to create the table as following:
Read the rest of this entry »
Flattening timespans: SQL Server
From Stack Overflow:
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan:
date start stop 2009.06.03 05:54:48:000 10:00:13:000 2009.06.03 09:26:45:000 09:59:40:000 The resulting timespan would be from 05:54:48 to 10:00:13.
Since there’s a gap between 10:00:13 and 10:12:50, we also have the following timespans:
date start stop 2009.06.03 10:12:50:000 10:27:25:000 2009.06.03 10:13:12:000 11:14:56:000 2009.06.03 10:27:25:000 10:27:31:000 2009.06.03 10:27:39:000 13:53:38:000 2009.06.03 11:14:56:000 11:15:03:000 2009.06.03 11:15:30:000 14:02:14:000 2009.06.03 13:53:38:000 13:53:43:000 2009.06.03 14:02:14:000 14:02:31:000 which result in one merged timespan from 10:12:50 to 14:02:31, since they’re overlapping or adjacent.
Any solution, be it SQL or not, is appreciated.
This is also quite a common task.
The algorithm here is simple:
- Order all timespans by
start - Initialize a variable which will hold a running maximum of the timespans’
stop - Take the timespans one by one, and for each next timespan, check its
startagainst a running maximum of the previousstop‘s. If the currentstartis greater than the maximum of previousstop‘s, it means that all timespans that started earlier have been ended, and we have a gap - Update the running maximum
Unfortunately, SQL Server does not provide a simple way to calculate running maximums. We could use a correlated subquery of course, but it would take forever.
This is one of quite rare cases when cursor in SQL Server work faster than set-based operations.
Let’s create a sample table and see how it works:
Read the rest of this entry »
Joining unjoinable
From Stack Overflow:
I want to combine two tables into one.
Let say I have:
Table1:
ID Name 1 A 2 B 3 C Table2:
ID Name 4 D 5 E 6 F I want to make Table3:
Name1 Name2 A D B E C F How can I do this in SQL Server?
This a certainly a task for a JOIN, since we have two tables here and want to match any row from the first table with another row from the second table.
But what condition do we use for a JOIN?
The ID‘s of the rows seem to be arbitrary and have no special meaning. They could as well be sparse and randomly distributed, so it’s impossible to build a formula which corresponds one ID to another. Same is true for the names.
However, there actually is a condition which we could JOIN on.
Since the author mentioned there is same number of rows in both tables, we could use the ROW_NUMBER() as a JOIN condition.
We just need to wrap both these queries into CTE‘s which select a ROW_NUMBER() along with the table data:
WITH table1 AS
(
SELECT 1 AS id, 'A' AS name
UNION ALL
SELECT 2 AS id, 'B' AS name
UNION ALL
SELECT 3 AS id, 'C' AS name
),
table2 AS
(
SELECT 4 AS id, 'D' AS name
UNION ALL
SELECT 5 AS id, 'E' AS name
UNION ALL
SELECT 6 AS id, 'F' AS name
),
q1 AS
(
SELECT table1.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
FROM table1
),
q2 AS
(
SELECT table2.*, ROW_NUMBER() OVER (ORDER BY name) AS rn
FROM table2
)
SELECT q1.name, q2.name
FROM q1
JOIN q2
ON q2.rn = q1.rn
| name | name | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | D | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| B | E | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| C | F | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 rows fetched in 0.0002s (0.0264s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 4 ms.
Longest common prefix: SQL Server
From Stack Overflow:
I have some data:
id ref 1 3536757616 1 3536757617 1 3536757618 2 3536757628 2 3536757629 2 3536757630 and want to get the result like this:
id result 1 3536757616/7/8 2 3536757629/28/30 Essentially, the data should be aggregated on
id, and theref‘s should be concatenated together and separated by a/(slash), but with longest common prefix removed.
I’ve already encoutered this problem several times, so I’ll try to cover solutions for all RDBMS‘s my blog is about:
- Longest common prefix: SQL Server
- Longest common prefix: PostgreSQL
- Longest common prefix: Oracle
- Longest common prefix: MySQL
I hope this will be interesting as approaches will differ significantly for all four systems.
Today is SQL Server time.
I won’t create sample tables here, since I’m demonstrating the principle. Instead, I’ll just use dynamically generated data.
What do we need to do here is:
- Find least common prefix for each group and its length
- Cut off the prefix of each but the first
ref, usingSUBSTRING - Concatenate the strings using
FOR XML
Steps 2 and 3 are quite simple, but the first one needs some effort.
SQL Server doesn’t provide a function to find the longest common prefix (LCP), so we’ll have to implement it.
With some limitations, it may be done using pure SQL, no UDF‘s.
Read the rest of this entry »
Subscribe in a reader