Archive for the ‘SQL Server’ Category
SQL Server: joining NULL values
From Stack Overflow:
UPDATE n SET AddressID = a.AddressID FROM #NewAddress n JOIN dbo.Address a ON (a.[LineA] = n.[LineA] OR (a.[LineA] is null AND n.[LineA] is null)) AND (a.[LineB] = n.[LineB] OR (a.[LineB] is null AND n.[LineB] is null)) AND (a.[LineC] = n.[LineC] OR (a.[LineC] is null AND n.[LineC] is null)) WHERE n.Processed = 0Any ideas on how to UNION or EXISTS this query? This seems to be a pretty common join condition I’m encountering.
Here’s a JOIN condition: we should join two tables on a common key, also treating NULL values as equal. Normal equality conditions don’t work here: in SQL, NULL values are not considered equal (strictly speaking, their equality is a boolean NULL rather than true or false).
ANSI standard defines a special predicate for that: IS NOT DISTINCT FROM, which is supported by PostgreSQL. This treats values the same value as DISTINCT does: two equal values, as well as two NULL values, are considered not distinct. MySQL also supports similar operator, < =>. Unfortunately, SQL Server supports neither.
In SQL Server, we can emulate this by using set operators which distinctify the queries. Set operators work on sets, so we would need some kind of a SELECT query for them. This would also help us to overcome SQL Server’s inability to compare tuples directly. Basically, we need to tell if two tuples: (a.lineA, a.lineB, a.lineC) and (n.lineA, n.lineB, n.lineC) are distinct or not.
We can do that by selecting those tuples in dummy queries (those without a FROM clause), intersecting those queries (which would give us a one-record or zero-record resultset, depending on whether the two tuples are distinct or not) and applying EXISTS to the resultset. EXISTS is a boolean predicate and as such can serve as a join condition.
Let’s check it. We’ll create two tables, fill them with random values (including occasional NULLs here and there):
INSERT … SELECT … OUTPUT …: returning source table data
Comments enabled. I *really* need your comment
From Stack Overflow:
I am trying to insert into the target table values from the source table, but also insert the primary key of the source table into the tracking table.
INSERT INTO TRN_TEMP (TRN_TRAN_DATE, TRN_DESCRIPTION, TRN_AMT) OUTPUT D.DET_PRIMARY, GETDATE() INTO REGISTER (DET_PRIMARY_LINK, INS_DATE) SELECT D.DET_DATE, 'SOMETEXT', SUM(D.DET_NET) FROM DETAIL D LEFT JOIN REGISTER R ON R.DET_PRIMARY_LINK = D.DET_PRIMARY WHERE R.LINE_ID IS NULL -- TO REMOVE LINES ALREADY PROCESSED GROUP BY D.DET_DATEI can’t see a problem with the text above but I get an error:
The multi part identifier 'D.DET_PRIMARY' could not be bound.I’ve tried both
D.DET_DETAILandDETAIL.DET_DETAILand the error is the same.
This is a limitation of SQL Server which can be easily worked around by substituting MERGE instead of INSERT.
Let’s create some sample tables and see:
CREATE TABLE
source
(
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
value INT NOT NULL
)
CREATE TABLE
destination
(
id INT NOT NULL IDENTITY PRIMARY KEY,
grouper INT NOT NULL,
aggregated_value INT NOT NULL
)
CREATE TABLE
register
(
dest_id INT NOT NULL PRIMARY KEY,
max_source_id INT NOT NULL UNIQUE,
grouper INT NOT NULL UNIQUE,
regdate DATE NOT NULL,
)
INSERT
INTO source
VALUES
(1, 1, 5),
(2, 1, 2),
(3, 2, 6),
(4, 2, 8)
, and run the query:
MERGE
INTO destination d
USING (
SELECT grouper, SUM(value), MAX(id)
FROM source
WHERE grouper NOT IN
(
SELECT grouper
FROM register
)
GROUP BY
grouper
) s (grouper, aggregated_value, max_source_id)
ON (1 = 0)
WHEN NOT MATCHED THEN
INSERT (grouper, aggregated_value)
VALUES (grouper, aggregated_value)
OUTPUT INSERTED.id, s.max_source_id, s.grouper, GETDATE()
INTO register;
SELECT *
FROM register;
You can try it yourself on SQL Fiddle.
As you can see, we are using both destination table’s INSERTED.id, source table’s s.max_source_id, s.grouper and an arbitrary expression GETDATE() in the OUTPUT clause which works just alright.
An alert reader may notice that there is another problem with the original query, but this will be covered in the next post. Watch this space!
What’s UNPIVOT good for?
Answering questions asked on the site.
Karen asks:
… I’ve always thought
PIVOTandUNPIVOTare signs of a poorly designed database. I mean, is there a legitimate use for them if your model is OK?
I’ve made an actual use for them in a project I’ve been working on for the last several months (which is partly why there were no updates for so long!)
Part of the project is a task management system where each task has several persons related to it. There can be the creator of the task, the person the task is assigned to, the actual author of the task (on behalf of whom the task is created), and the task can be possible completed by a person or deleted by a person. A total of 5 fields related to persons.
Now, we need to take all tasks within a certain time range and list all people involved in them.
Let’s create a sample table and see how would we do that.
GROUP_CONCAT in SQL Server
Comments enabled. I *really* need your comment
I’m finally back from my vacation. Tunisia’s great: dates, Carthage, sea and stuff.
Now, to the questions.
Mahen asks:
Create a table called
Group:
Group id prodname 1 X 1 Y 1 Z 2 A 2 B 2 C The resultset should look like this:
id prodname 1 X,Y,Z 2 A,B,C Can you please help me to solve the above problem using a recursive CTE?
This is out good old friend, GROUP_CONCAT. It’s an aggregate function that returns all strings within a group, concatenated. It’s somewhat different from the other aggregate functions, because, first, dealing with the concatenated string can be quite a tedious task for the groups with lots of records (large strings tend to overflow), and, second, the result depends on the order of the arguments (which is normally not the case for the aggregate functions). It’s not a part of a standard SQL and as for now is implemented only by MySQL with some extra vendor-specific keywords (like ORDER BY within the argument list).
This functionality, however, is often asked for and I have written some articles about implementing this in PostgreSQL and Oracle.
Now, let’s see how to do it in SQL Server.
Usually, SQL Server‘s FOR XML clause is exploited to concatenate the strings. To do this, we obtain a list of group identifiers and for each group, retrieve all it’s product names with a subquery appended with FOR XML PATH(''). This makes a single XML column out of the recordset:
Read the rest of this entry »
SQL Server: deleting with self-referential FOREIGN KEY
Comments enabled. I *really* need your comment
From Stack Overflow:
I have an SQL Server table defined as below:
TestComposite id (PK) siteUrl (PK) name parentId 1 site1 Item1 NULL 2 site1 Item2 NULL 3 site1 Folder1 NULL 4 site1 Folder1.Item1 3 5 site1 Folder1.Item2 3 6 site1 Folder1.Folder1 3 7 site1 Folder1.Folder1.Item1 6 Items and folders are stored inside the same table
If an item is inside a folder, the
parentIDcolumn is theidof the folder.I would like to be able to
DELETE CASCADEitems/folders when I delete a folder.I tried to define a constraint similar to:
ALTER TABLE [TestComposite] ADD CONSTRAINT fk_parentid FOREIGN KEY (ParentID, SiteUrl) REFERENCES [TestComposite] (ID, SiteUrl) ON DELETE CASCADE, but it gives me this error:
Introducing FOREIGN KEY constraint 'fk_parentid' on table 'TestComposite' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
SQL Server does support chained CASCADE updates, but does not allow one table to participate more that once in a chain (i. e. does not allow loops).
SQL Server, unlike most other engines, optimizes cascading DML operations to be set-based which requires building a cycle-free DML order (which you can observe in the execution plan). With the loops, that would not be possible.
However, it is possible to define such a constraint without cascading operations, and with a little effort it is possible to delete a whole tree branch at once.
Let’s create a sample table:
Read the rest of this entry »
Efficient circle distance testing
Comments enabled. I *really* need your comment
Answering questions asked on the site.
eptil asks:
I am using SQL Server 2008, but not the spatial features.
I have a table with few entries, only 40,000. There is an
id INT PRIMARY KEYcolumn and two columns storing a 2d coordinate, both decimals.I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:
SELECT id, x, y FROM mytable t1 WHERE ( SELECT COUNT(*) FROM mytable t2 WHERE ABS(t1.x - t2.x) < 25 AND ABS(t1.y - t2.y) < 25 ) = 1
This is taking 15 minutes to run at times.Is there a better way?
Of course using spatial abilities would be a better way, but it is possible to make do with plain SQL. This will also work in SQL Server 2005.
In most database engines, the spatial indexes are implemented as the R-Tree structures. SQL Server, however, uses another approach: surface tesselation.
Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain B-Tree index.
When SQL Server‘s optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that possibly can satisfy this predicate. Say, if the tiles are defined as squares with side 1, the predicate column.STDistance(@mypoint) < 2 can only be satisfied by the objects within 2 tiles away from @mypoint‘s tile. This gives a square of 25 tiles with @mypoint‘s tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.
Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don’t have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.
Let’s create a sample table and see how it works:
Read the rest of this entry »
SQL Server: EXCEPT ALL
Comments enabled. I *really* need your comment
Answering questions asked on the site.
myst asks:
I have two really large tables with lots of columns, many of them are nullable
I need to remove all rows from
table1which are not present intable2, but there can be duplicates in both tables and writing more than 70IS NULLconditions would be a pain, and I want to make sure there’s nothing I’m missing.Is there a more simple way?
SQL Server supports EXCEPT clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML operation.
ANSI SQL standard describes EXCEPT ALL which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.
Similar behavior can be achieved using NOT IN or NOT EXISTS constructs. But in SQL Server, IN predicate does not accept more than one field. NOT EXISTS accepts any number of correlated columns, but it requires extra checks in the WHERE clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR predicates or COALESCE, neither of which adds to performance.
But there is a way to emulate EXCEPT ALL in SQL Server quite elegantly and efficiently.
Let’s create two sample tables:
Read the rest of this entry »
SQL Server: running totals
From Stack Overflow:
We have a table of transactions which is structured like the following :
transactions TranxID ItemID TranxDate TranxAmt
TranxAmtcan be positive or negative, so the running total of this field (for anyItemID) will go up and down as time goes by.Getting the current total is obviously simple, but what I’m after is a performant way of getting the highest value of the running total and the
TranxDatewhen this occurred.Note that TranxDate is not unique.
SQL Server is a very nice system, but using it for calculating running totals is a pain.
Oracle supports additional clauses for analytic functions, RANGE and ROWS, which define the boundaries of the function’s windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.
SQL Server‘s support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM and COUNT it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.
The common way to write such a running total query is using a subquery or a self join which would count the SUM of all previous records. However, the complexity of this query is O(n^2) and it’s not usable for any real volumes of data.
This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.
This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.
Let’s create a sample table:
Read the rest of this entry »
Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »
Grouping continuous ranges
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a ticketing system.
Now I have to select adjacent places when the user asks for 2 or 3 tickets.
Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers.
These tickets are in an SQL Server database.
Any ideas about this algorithm to search for available adjacent seats?
This is a problem known as grouping continuous ranges
: finding the continuous ranges of the records (in a certain order) having the same value of the grouping column.
For the table in question, the groups will look like this:
| Row | Column | Occupied | Group |
|---|---|---|---|
| 1 | 1 | 1 | - |
| 1 | 2 | 0 | 1 |
| 1 | 3 | 0 | 1 |
| 1 | 4 | 0 | 1 |
| 1 | 5 | 1 | - |
| 1 | 6 | 1 | - |
| 1 | 7 | 0 | 2 |
| 1 | 8 | 0 | 2 |
| 1 | 9 | 1 | - |
| 1 | 10 | 0 | 3 |
To find the spans of free seats with the required length, we need to group the continuous ranges of rows with occupied = 0 in column order (within the rows), calculate the counts of records in these groups and return the groups having sufficient value of COUNT(*).
To group something in SQL, we need to have an expression that would have the same value for all records belonging to the group.
How do we build such an expression for the continuous ranges?
In a movie theater, seats are numbered sequentially, so we can rely on the values of the column being continuous. Now, let’s calculate only the records that describe the free seats, and build their row numbers in the column order:
| Row | Column | Occupied | ROW_NUMBER |
|---|---|---|---|
| 1 | 2 | 0 | 1 |
| 1 | 3 | 0 | 2 |
| 1 | 4 | 0 | 3 |
| 1 | 7 | 0 | 4 |
| 1 | 8 | 0 | 5 |
| 1 | 10 | 0 | 6 |
We see that filtering only free seats broke the column numbering order (the columns numbers are not continuous anymore), but the ROW_NUMBERs are continuous (by definition). Each occupied seat breaks the continuity of the columns, and the spans of the free seats correspond to the ranges of columns with unbroken continuity.
If the continuity is not broken, column and ROW_NUMBER both increase by 1 with each record. This means that their difference will be constant:
| Row | Column | Occupied | ROW_NUMBER | column – ROW_NUMBER |
|---|---|---|---|---|
| 1 | 2 | 0 | 1 | 1 |
| 1 | 3 | 0 | 2 | 1 |
| 1 | 4 | 0 | 3 | 1 |
| 1 | 7 | 0 | 4 | 3 |
| 1 | 8 | 0 | 5 | 3 |
| 1 | 10 | 0 | 6 | 4 |
The value of difference between column and ROW_NUMBER, therefore, uniquely defines the continuous group the records belong to.
This means we can group on it and do other things (the rest being pure technical).
Let’s create a sample table:
Read the rest of this entry »
Subscribe in a reader