Archive for the ‘SQL Server’ Category
Rounding numbers preserving their sum
From Stack Overflow:
I’m looking for a way to round every value contained in a column without changing the total sum (which is known to be an integer, or very close to an integer because of float number precision).
Rounding every value to the nearest integer won’t work (ex : (1.5, 1.5) will be rounded to either (1, 1) or (2, 2))
It’s quite easy to do this using several requests (e. g. storing the original sum, rounding, computing the new sum, and updating as many rows as needed to go back to the original sum), but this is not a very elegant solution.
Is there a way to do this using a single SQL request?
I’m using SQL Server 2008, so solutions taking advantage of this specific vendor are welcome.
I’m looking for a request minimizing the differences between the old values and the new ones. In other words, a value should never be rounded up if a greater value has been rounded down, and vice-versa
We have two restrictions here:
- The sum of the rounded numbers should be equal to that of the unrounded ones.
- When rounding, no number with a greater fractional part can be rounded down if there is a number with a smaller fractional part that had been rounded up.
We cannot round (1.4, 1.4, 1.2) to (1, 1, 2), since 1.4 would be rounded down while 1.2 (with smaller fractional part) is rounded up. However, we can round them to (1, 2, 1) or (1, 2, 1), since nothing prevents us from rounding numbers with equal fractional part differently.
Given this, we should build a query that would take a set of numbers and round them taking these two restrictions into account.
It is easily seen than to satisfy the second restriction a certain rounding threshold should exist: all numbers with the fractional part greater than this threshold are rounded up, those with the fractional part smaller than then threshold are rounded down, those with the fractional part equal to the threshold can be rounded both ways.
However, calculating this threshold is, first, quite a difficult task and, second, we still don’t know what to do with the numbers having the fractional part equal to the threshold.
Instead, we use another approach.
Let’s create a sample tables and show this approach:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.
Which method is best to select values present in one table but missing in another one?
This:
SELECT l.* FROM t_left l LEFT JOIN t_right r ON r.value = l.value WHERE r.value IS NULL, this:
SELECT l.* FROM t_left l WHERE l.value NOT IN ( SELECT value FROM t_right r )or this:
SELECT l.* FROM t_left l WHERE NOT EXISTS ( SELECT NULL FROM t_right r WHERE r.value = l.value )
Differences between the methods
These methods are quite different.
First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right
LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right‘s actual values.
Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right‘s row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row’s value is not equal to one of those in t_right.
The same holds for NOT EXISTS. Since it’s a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.
NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).
Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.
NOT IN, however, behaves differently.
IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:
TRUEis returned when the non-NULLvalue in question is found in the listFALSEis returned when the non-NULLvalue is not found in the list and the list does not containNULLvaluesNULLis returned when the value isNULL, or the non-NULLvalue is not found in the list and the list contains at least oneNULLvalue
IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.
This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.
However, NOT IN is different, since negation of NULL is NULL as well.
That’s why NOT IN condition will never hold for any list with a NULL value in it.
- If a row is found in the list,
INwill returnTRUEandNOT IN, therefore, will returnFALSE - If a row is not found in the list,
INwill returnNULL, andNOT INon its turn will also returnNULL
Both conditions will of course be filtered out by the WHERE clause.
Let’s illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:
WITH t_left AS
(
SELECT 1 AS value
UNION ALL
SELECT NULL
),
t_right AS
(
SELECT 2 AS value
UNION ALL
SELECT NULL
)
SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)
| value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| NULL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 rows fetched in 0.0001s (0.0006s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.
WITH t_left AS
(
SELECT 1 AS value
UNION ALL
SELECT NULL
),
t_right AS
(
SELECT 2 AS value
UNION ALL
SELECT NULL
)
SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right
)
| value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 rows fetched in 0.0001s (0.0005s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.
IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.
However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.
SQL Server 2005: finding foreign keys that have no index
From Stack Overflow:
How can I programmatically determine which foreign key are missing indexes?
Concise. Let’s try to answer it.
A foreign key involves two tables: a referencing one and a referenced one.
In most databases, some kind of a unique key is required on the referenced table (this may be a PRIMARY KEY or just a UNIQUE constraint). This is by design (when we delete a UNIQUE value, we always delete last instance of this value).
However, indexing the FOREIGN KEY fields in the referencing table is not required. Database can function even without this index. However, this can make deletes from the referenced table slow (to validate the relationship we should perform a full table lookup on the referenced table). It almost always makes sense to index the FOREIGN KEY fields.
For an index on the FOREIGN KEY fields to be of use, its leading key part should consist of all columns that comprise the FOREIGN KEY, in any order. Since FOREIGN KEY checks are always equijoins, not range joins, the order of columns does not matter as long as they are leading in the key.
Consider a constraint: FOREIGN KEY (col1, col2) REFERENCES othertable. Which indexes are good for this foreign key?
| Index | Is good? |
|---|---|
col1 |
No. Though it’s better than nothing, only the part of the foreign key can be used in the index range. |
col1, col2 |
Yes. It’s a classical case |
col2, col1 |
Yes. The order does not matter, since we are searching for exact values of both col1 and col2 |
col2, col1, col3 |
Yes. Despite an extra column, this index can be used to filter on col1 and col2 |
col3, col1, col2 |
No. Though both col1 and col2 are indexed, they are not in the leftmost part of the index |
So for each foreign key, we should check the following condition: there exists at least one index whose leading part consists of (and only of) all columns that constitute the key.
In SQL Server 2005, the information about indexed columns is stored in sys.index_column. Since indexes can be composite, this is a one-to-many relationship, where index_id and column_id define the index and the column indexed with it, while index_column_id defines column’s ordinal position in the index.
For an index to satisfy the foreign key, two conditions should be fulfilled:
- The number of columns in the index that are also a part of the foreign key should be equal to the maximal
index_column_id. This ensures that all columns constitute the leftmost part of the index. - The number of columns in the index that are also a part of the foreign key should be equal to the number of columns in the foreign key. This ensures that all foreign key columns are indexed.
And here’s the query:
SELECT fk.name, s.name, o.name,
FROM sys.foreign_keys fk
JOIN sys.objects o
ON o.object_id = fk.parent_object_id
JOIN sys.schemas s
ON s.schema_id = o.schema_id
WHERE EXISTS
(
SELECT *
FROM sys.index_columns ic
WHERE EXISTS
(
SELECT NULL
FROM sys.foreign_key_columns fkc
WHERE fkc.constraint_object_id = fk.object_id
AND fkc.parent_object_id = ic.object_id
AND fkc.parent_column_id = ic.column_id
)
GROUP BY
index_id
HAVING COUNT(*) = MAX(index_column_id)
AND COUNT(*) =
(
SELECT COUNT(*)
FROM sys.foreign_key_columns fkc
WHERE fkc.constraint_object_id = fk.object_id
)
)
Now, let’s create some sample tables and check this query:
SQL Server: leading wildcard match using an index
From Stack Overflow:
I’m joining to a table dozens of different times, and every time, I join (or filter) based on the results of a
SUBSTRINGof one of the columns (it’s a string, but left-padded with zeros, and I don’t care about the last four digits).As a result, even though this column is indexed and my query would use the index, it does a table scan because the
SUBSTRINGitself isn’t indexed, so SQL Server has to calculate it for every row before as it’s joining.I’m looking for any ideas on how to speed up this process.
This problem is called leading wildcard match. Though it’s possible to implement it, it usually requires indexing all possible combinations of substrings which usually bloats the index.
However, in this very case it’s possible to create quite an efficient index. It is possible because the number of characters we need to trim from the right is fixed.
The main idea is to create a computed column which would store trimmed and reversed value of the string, index that column and use this index in search.
Let’s create a sample table:
SQL Server: finding date closest to a birthdate
From Stack Overflow:
The problem is with dynamically getting upcoming what we call
namedaysin Europe.Name day is a tradition in many countries in Europe and Latin America of celebrating on a particular day of the year associated with the one’s given name.
Here’s a sampe name day table (I store the date in this format:
1900-08-22, but we really only need month and day):
First name Name day Bob 1900-04-22 Bob 1900-09-04 Frank 1900-01-02 The trick is that there might be multiple name days for each name, and someones name day is always the first one found after ones birthday.
So if Bob was born on August 5th, his name day would fall on September 4th, but if we were born after Sep 4th, his name day would be on April 22nd.
We need to find when a given person celebrates his/her name day based on the first name and birthdate.
What I need is an SQL Server query that will be able to get me name days for the people in my database.
We actually need to make the calendar to wrap around: to build a condition that would select 1904, Jan 1st after 1904, Dec 31th.
We can do it using a COALESCE on the simple queries in OUTER APPLY clause.
Let’s create sample tables:
SQL Server: efficient DISTINCT on dates
From Stack Overflow:
I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.
My problem isn’t getting the server to actually do this (I’ve seen this question already, and we had something similar already in place using
DISTINCT).The problem is whether there is any trick to get it done more quickly.
With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.
The problem with this query is that no distinct list of truncated dates is easily accessible.
That’s why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.
In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME column. Then it would just use jump over the distinct index keys.
Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).
SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.
That’s why creating an index on a computed column doesn’t help.
But since we are talking about the dates, it’s possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.
Let’s create a sample table:
Efficient date range query: SQL Server
From Stack Overflow:
I have a table
Athat has astartDateand anendDateas 2DATETIMEcolumns besides some more other columns.I have another table
Bthat has oneDATETIMEcolumn (call itdatescolumn).This is in SQL Server 2005.
How to best set up the indexes etc to get the following:
SELECT * FROM a, b WHERE a.startDate >= b.dates AND a.endDate < b.datesBoth tables have several thousand records.
Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.
A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).
However, this query is just the opposite: we search for a constaint between two unknown values.
If we create the indexes on startDate and endDate and issue a query with a condition like this:
startDate < = value AND endDate > value
, this will just give two range condition on two distinct indexes:
-INF < = startDate <= value
value < endDate <= +INF
Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.
There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL indexes.
In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Points, LineStrings and such).
There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):
However, SQL Server 2005 does not support SPATIAL indexes.
In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).
However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I’ll show how to do it.
Let’s create a sample table:
Read the rest of this entry »
Fallback language names: SQL Server
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
From Stack Overflow:
I have table
itemand another tablelanguagewhich contains names for theitemsin different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCEon twoSELECTlist subqueries - Use
COALESCEon the results of twoLEFT JOINS - Use the combination of methods above: a
LEFT JOINfor French names and a subquery for English ones
Yesterday, we saw that the second option (two LEFT JOINs) is best in Oracle if the probability of the fallback is high (i. e. the localization is poor and but few terms are translated into the local language). I took Latin language as an example for this.
Third query (a LEFT JOIN with the local language rowset and a subquery in a COALESCE to retrive the fallback language names) is best if there is a low probability of a fallback, i. e. almost all terms are translated. In this case, I took French as an example, since it is widely used and localizations are likely to cover most terms.
Today we will see how same queries work in SQL Server.
Let’s create sample tables:
SQL Server: deleting adjacent values (improved)
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY
(
SELECT TOP 1 qi.id AS mint
FROM [20090803_adjacent].records qi
WHERE qi.unitid = qo.unitid
AND qi.day = qo.day
AND qi.interval < = qo.interval
AND NOT EXISTS
(
SELECT NULL
FROM [20090803_adjacent].records t
WHERE t.unitid = qi.unitid
AND t.day = qi.day
AND t.interval = qi.interval - 1
)
ORDER BY
qi.interval DESC
)
ORDER BY interval
) AS rnm
FROM [20090803_adjacent].records qo
)
DELETE
FROM rows
WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id then serves as a range identifier in a PARTITION BY clause.
However, this query can be improved.
Assume we have the following intervals:
| interval |
|---|
| 3 |
| 4 |
| 5 |
| 6 |
| 10 |
| 11 |
| 12 |
| 20 |
| 21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:
| interval | ROW_NUMBER |
|---|---|
| 3 | 1 |
| 4 | 2 |
| 5 | 3 |
| 6 | 4 |
| 10 | 5 |
| 11 | 6 |
| 12 | 7 |
| 20 | 8 |
| 21 | 9 |
ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.
This means that we can just subtract ROW_NUMBER from the interval:
| interval | ROW_NUMBER | ROW_NUMBER - interval |
|---|---|---|
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 10 | 5 | 5 |
| 11 | 6 | 5 |
| 12 | 7 | 5 |
| 20 | 8 | 12 |
| 21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check:
SQL Server: deleting adjacent values
From Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:
SELECT *
FROM mytable qi
WHERE EXISTS
(
SELECT NULL
FROM mytable t
WHERE t.unitid = qi.unitid
AND t.day = qi.day
AND t.interval = qi.interval - 1
)
However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.
If we just delete everything, the query above will not return any records just as well, but it’s not our goal.
The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.
So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).
In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.
However, we can formulate the delete condition.
We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.
The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).
The second record of any range should be deleted, since it’s adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).
The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.
In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.
The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.
It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.
Let’s create a sample table and compose the query:
Subscribe in a reader