Archive for the ‘SQL Server’ Category
SQL Server: deleting adjacent values
From Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:
SELECT *
FROM mytable qi
WHERE EXISTS
(
SELECT NULL
FROM mytable t
WHERE t.unitid = qi.unitid
AND t.day = qi.day
AND t.interval = qi.interval - 1
)
However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.
If we just delete everything, the query above will not return any records just as well, but it’s not our goal.
The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.
So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).
In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.
However, we can formulate the delete condition.
We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.
The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).
The second record of any range should be deleted, since it’s adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).
The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.
In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.
The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.
It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.
Let’s create a sample table and compose the query:
SQL Server: random records avoiding CTE reevaluation
From Stack Overflow:
I need to get 5 random records from a table plus a further record based on data from the users preferences.
How can I combine the two statements whilst ensuring that no results are repeated (i. e. the favourite is not present in the 5 random records)?
We should select the 6th record so that it would satisfy two conditions: first, it should be preferred by user, second, it should not be among the 5 random records selected earlier.
The problem with random records it that they are, um, random. We usually cannot reproduce 5 random records to apply further conditions on them.
That’s where the CTEs would come handy, but there’s a little problem.
A CTE, as Microsoft understands them, is just an alias for an online view.
Therefore, each time a CTE is used in a query, it may (or may not) be reevaluated.
It’s not a problem for a plain CTEs which are built out of static columns.
However, this is totally a problem when we try to UNION two queries using an undeterministic CTE (like, ordered by NEWID()), as we would do in this case.
Let’s create a sample table and see what happens:
Dynamic pivot
From Stack Overflow:
I have a table and want to transpose its rows to columns, similar to a pivot table but without summarising.
For example I have the following tables:
Question (QuestionID, QuestionText) Response (ResponseID, ResponseText, QuestionID)Basically I want to be able to create a dynamic table something like:
Question 1 Question 2 Question 3 Response 1.1 Response 1.2 Response 1.3 Response 2.1 Response 2.2 Response 2.3 Response 3.1 Response 3.2 Response 3.3 Response 4.1 Response 4.2 Response 4.3 The main requirement would be I don’t know at design time what the question text will be.
First, let’s put it straight: SQL is not a right tool to do this.
SQL operates on sets. An SQL query is a functions that returns a set. And the column layout is an implicit property of a set. It should be defined in design time. This is the way SQL works.
However, there are several approaches that can simplify the task. One of them is using SQL Server‘s XML abilities.
To arrange the answers in a table, we need to make the following:
- Select all questions. This rowset will define the columns of our table.
- Select the row numbers for all answers for the question that has the most answers. This rowset will define the rows of our table. The number of rows wil be exactly as the number of answers to the most answered question.
- Assign the ordinal number to each answer on any question. This is done using
ROW_NUMBER(). This value will uniquely define the position of the answer in a table. - For each row (taken from the rowset made on step 2), take all the questions and
LEFT JOINthe answers that need to be put into this row, using the ordinal number calculated on the step 3 as aLEFT JOINcondition. This will put either the answer or an empty string into the corresponding cell. - Transform the rowset retrieved on step above into an XML string made out of
<td>‘s and enclosed into a<tr> - Transform the resulting rowset into an XML made out of the XML‘s from the previous step and enclosed into a
<table>
Let’s create a sample table and see how can we do this:
INNER JOIN vs. CROSS APPLY
From Stack Overflow:
Can anyone give me a good example of when
CROSS APPLYmakes a difference in those cases whereINNER JOINwill work as well?
This is of course SQL Server.
A quick reminder on the terms.
INNER JOIN is the most used construct in SQL: it joins two tables together, selecting only those row combinations for which a JOIN condition is true.
This query:
SELECT * FROM table1 JOIN table2 ON table2.b = table1.a
reads:
For each row from
table1, select all rows fromtable2where the value of fieldbis equal to that of fielda
Note that this condition can be rewritten as this:
SELECT * FROM table1, table2 WHERE table2.b = table1.a
, in which case it reads as following:
Make a set of all possible combinations of rows from
table1andtable2and of this set select only those rows where the value of fieldbis equal to that of fielda
These conditions are worded differently, but they yield the same result and database systems are aware of that. Usually both these queries are optimized to use the same execution plan.
The former syntax is called ANSI syntax, and it is generally considered more readable and is recommended to use.
However, it didn’t get into Oracle until recently, that’s why there are many hardcore Oracle developers that are just used to the latter syntax.
Actually, it’s a matter of taste.
To use JOINs (with whatever syntax), both sets you are joining must be self-sufficient, i. e. the sets should not depend on each other. You can query both sets without ever knowing the contents on another set.
But for some tasks the sets are not self-sufficient. For instance, let’s consider the following query:
We table
table1andtable2.table1has acolumncalledrowcount.For each row from
table1we need to select firstrowcountrows fromtable2, ordered bytable2.id
We cannot formulate a join condition here. The join condition, should it exists, would involve the row number, which is not present in table2, and there is no way to calculate a row number only from the values of columns of any given row in table2.
That’s where the CROSS APPLY can be used.
CROSS APPLY is a Microsoft’s extension to SQL, which was originally intended to be used with table-valued functions (TVF‘s).
The query above would look like this:
SELECT *
FROM table1
CROSS APPLY
(
SELECT TOP (table1.rowcount) *
FROM table2
ORDER BY
id
) t2
For each from
table1, select firsttable1.rowcountrows fromtable2ordered byid
The sets here are not self-sufficient: the query uses values from table1 to define the second set, not to JOIN with it.
The exact contents of t2 are not known until the corresponding row from table1 is selected.
I previously said that there is no way to join these two sets, which is true as long as we consider the sets as is. However, we can change the second set a little so that we get an addicional calculated field we can later join on.
The first option to do that is just count all preceding rows in a subquery:
SELECT *
FROM table1 t1
JOIN (
SELECT t2o.*,
(
SELECT COUNT(*)
FROM table2 t2i
WHERE t2i.id < = t2o.id
) AS rn
FROM table2 t2o
) t2
ON t1.rowcount = t2.rn
The second option is to use a window function, also available in SQL Server since version 2005:
SELECT *
FROM table1 t1
JOIN (
SELECT t2o.*, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM table2 t2o
) t2
ON t1.rowcount = t2.rn
This functions returns the ordinal number a row would have be the ORDER BY condition used in the function applied to the whole query.
This is essentially the same result as the subquery used in the previous query.
Now, let’s create the sample tables and check all these solutions for efficiency:
SQL Server: aggregate bitwise OR
From Stack Overflow:
I am creating a script for merging and deleting duplicate rows from a table.
The table contains address information, and uses an integer field for storing information about the email as bit flags (column name
value). For example, if bit 1 is set invalue, that means the record is a primary address.There are instances of the same email being entered twice, but sometimes with different
values. To resolve this, I need to take thevaluefrom all duplicates, assign them to one surviving record and delete the rest.My biggest headache so far as been with the merging of the records. What I want to do is bitwise
ORallvalues of duplicate records together.
From database theory’s point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.
However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.
We have three problems here:
- Select a first record for each set of duplicates
- Update this record with bitwise
ORof all values in its set - Delete all other records
Step 1 is easy to do using ROW_NUMBER().
Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.
See this article I wrote some time ago on how to do this:
Now, the main problem is step 2.
SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.
The main idea here is that for bit values, aggregate OR and AND can be replaced with MAX and MIN, accordingly.
All we need is to split each value into the bits, aggregate each bit and merge the results together.
Let’s create a sample table:
Read the rest of this entry »
Selecting compatible articles
From Stack Overflow:
I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).
So for a list of article numbers
A, B,… , Nthe question is:Give me all articles that are compatible with
A and B and … and NFor example, consider the table
A B 1 2 3 1 3 4 If I wanted all articles that are compatible with 1, the query would return (2, 3).
The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.
This table describes a friendship: a symmetric irreflexive binary relation.
That is:
- For any given
a,b, ifais a friend tob, thenbis a friend toa - For any given
a,ais never a friend to itself
This relation is heavily used by social networks.
A normalized table describing this relation should be defined like this:
CREATE TABLE t_set (
a INT NOT NULL,
b INT NOT NULL
)
ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b)
, the check being added to account for the relation symmetry.
Complete relation can be retrieved with the following query:
Read the rest of this entry »
Selecting birthdays
Answering questions asked on the site.
James asks:
I’m developing a forum and want to select all users that have a birthday within the next 3 days.
How do I do it?
This is in SQL Server
This is a very nice feature which every decent forum should have, and I’ll be glad to answer this question.
Unfortunately you didn’t provide the names of your tables, so I’ll have to make them up.
It’s not much of a stretch to assume that your table is called t_user and you keep the user’s birthdates in a DATETIME field called birthdate.
A birthday within next 3 days means that if you add the person’s age to the person’s birthdate, you get a date between the current date and three days after it.
To check this, we just need to calculate the number of months between the dates and make sure that it’s divisible by 12 with reminder of 0 or 11 (to handle month transitions).
Then we need to add a transition month and divide the number of months by 12. The quotient will give us the number of years we need to add to the birthdate to compare the result with GETDATE().
Let’s create a sample table and see how to do it:
Read the rest of this entry »
Finding duplicates on either of the fields
From Stack Overflow:
This query for creating a list of сandidate duplicates is easy enough:
SELECT COUNT(*), Can_FName, Can_HPhone, Can_EMail FROM can GROUP BY Can_FName, Can_HPhone, Can_EMail HAVING COUNT(*) > 1But if the actual rule I want to check against is
FNameand(HPhone OR Email)— how can I adjust theGROUP BYto work with this?
This is quite a common need.
The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.
Let’s imagine a sample dataset:
| id | name | phone | |
|---|---|---|---|
| 1 | John | 555-00-00 | john@example.com |
| 2 | John | 555-00-01 | john@example.com |
| 3 | John | 555-00-01 | john-other@example.com |
| 4 | James | 555-00-00 | james@example.com |
| 5 | James | 555-00-01 | james-other@example.com |
We see that records 1 and 2 are duplicates
(I’d better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.
But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn’t be considered similar if not for the record 2 that binds them.
Given this, we should use similarity chains instead of GROUP BY.
A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).
Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.
This set can be identified by the id of the least node.
We can solve this tasks using recursive CTE‘s. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.
Let’s create a sample table:
Read the rest of this entry »
Searching for NULL columns
From Stack Overflow:
I want to make sure I’m not inserting a duplicate row into my table (i. e. only
PRIMARY KEYdifferent).All my fields allow
NULL‘s as I’ve decidedNULLto meanall values.Because of
NULL‘s, the following statement in my stored procedure can’t work:IF EXISTS ( SELECT * FROM MY_TABLE WHERE MY_FIELD1 = @IN_MY_FIELD1 AND MY_FIELD2 = @IN_MY_FIELD2 AND MY_FIELD3 = @IN_MY_FIELD3 AND MY_FIELD4 = @IN_MY_FIELD4 AND MY_FIELD5 = @IN_MY_FIELD5 AND MY_FIELD6 = @IN_MY_FIELD6 ) BEGIN goto on_duplicate END, since
NULL = NULLis not true.How can I check for the duplicates without having an
IF ISNULLstatement for every column?
First of all, can’t help being a smartass and tell that a UNIQUE index should be defined over these columns.
SQL Server does index NULL values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.
Now, back to business. Of course there are cases when creating such an index is not acceptable.
In these cases a very nice but little known operator INTERSECT comes handy.
This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server‘s understands common: each value in each column should be either equal to one in the other row or they both should be NULL‘s.
Let’s create a sample table:
Read the rest of this entry »
Overlapping ranges: SQL Server
From Stack Overflow:
I have an event table that specifies a date range with
start_dateandend_datefields.I have another date range, specified in code, that defines the current week as
week_startandweek_end.I’d like to query all events for the week.
This query needs to find all events that overlap the week.
The condition for overlapping ranges is well known and quite simple:
start_date < @week_end AND end_date > @week_start
However, in SQL more simple
doesn’t always mean more efficient
, and the following condition, despite being more complex, can yield better performance:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
Let’s create a sample table and see how so:
Read the rest of this entry »
Subscribe in a reader