Archive for the ‘SQL Server’ Category
Efficient circle distance testing
Comments enabled. I *really* need your comment
Answering questions asked on the site.
eptil asks:
I am using SQL Server 2008, but not the spatial features.
I have a table with few entries, only 40,000. There is an
id INT PRIMARY KEY
column and two columns storing a 2d coordinate, both decimals.I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:
SELECT id, x, y
FROM mytable t1
WHERE (
SELECT COUNT(*)
FROM mytable t2
WHERE ABS(t1.x - t2.x) < 25 AND ABS(t1.y - t2.y) < 25 ) = 1 [/sourcecode]
This is taking 15 minutes to run at times.Is there a better way?
Of course using spatial abilities would be a better way, but it is possible to make do with plain SQL. This will also work in SQL Server 2005.
In most database engines, the spatial indexes are implemented as the R-Tree structures. SQL Server, however, uses another approach: surface tesselation.
Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain B-Tree index.
When SQL Server's optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that possibly can satisfy this predicate. Say, if the tiles are defined as squares with side 1, the predicate column.STDistance(@mypoint) < 2
can only be satisfied by the objects within 2 tiles away from @mypoint
's tile. This gives a square of 25 tiles with @mypoint
's tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.
Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don't have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.
Let's create a sample table and see how it works:
Read the rest of this entry »
SQL Server: EXCEPT ALL
Answering questions asked on the site.
myst asks:
I have two really large tables with lots of columns, many of them are nullable
I need to remove all rows from
table1
which are not present intable2
, but there can be duplicates in both tables and writing more than 70IS NULL
conditions would be a pain, and I want to make sure there's nothing I'm missing.Is there a more simple way?
SQL Server supports EXCEPT
clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML
operation.
ANSI SQL standard describes EXCEPT ALL
which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.
Similar behavior can be achieved using NOT IN
or NOT EXISTS
constructs. But in SQL Server, IN
predicate does not accept more than one field. NOT EXISTS
accepts any number of correlated columns, but it requires extra checks in the WHERE
clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR
predicates or COALESCE
, neither of which adds to performance.
But there is a way to emulate EXCEPT ALL
in SQL Server quite elegantly and efficiently.
Let's create two sample tables:
Read the rest of this entry »
SQL Server: running totals
From Stack Overflow:
We have a table of transactions which is structured like the following :
transactions TranxID ItemID TranxDate TranxAmt
TranxAmt
can be positive or negative, so the running total of this field (for anyItemID
) will go up and down as time goes by.Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the
TranxDate
when this occurred.Note that TranxDate is not unique.
SQL Server is a very nice system, but using it for calculating running totals is a pain.
Oracle supports additional clauses for analytic functions, RANGE
and ROWS
, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE
clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.
SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM
and COUNT
it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.
The common way to write such a running total query is using a subquery or a self join which would count the SUM
of all previous records. However, the complexity of this query is O(n^2)
and it's not usable for any real volumes of data.
This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.
This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.
Let's create a sample table:
Read the rest of this entry »
Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id
, mode is the (exact) value of scale
shared by most records with this e_id
.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1
and 10 other records have scale = 2
(and all other values of scale
are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »
Grouping continuous ranges
From Stack Overflow:
I have a ticketing system.
Now I have to select adjacent places when the user asks for 2 or 3 tickets.
Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers.
These tickets are in an SQL Server database.
Any ideas about this algorithm to search for available adjacent seats?
This is a problem known as grouping continuous ranges
: finding the continuous ranges of the records (in a certain order) having the same value of the grouping column.
For the table in question, the groups will look like this:
Row | Column | Occupied | Group |
---|---|---|---|
1 | 1 | 1 | - |
1 | 2 | 0 | 1 |
1 | 3 | 0 | 1 |
1 | 4 | 0 | 1 |
1 | 5 | 1 | - |
1 | 6 | 1 | - |
1 | 7 | 0 | 2 |
1 | 8 | 0 | 2 |
1 | 9 | 1 | - |
1 | 10 | 0 | 3 |
To find the spans of free seats with the required length, we need to group the continuous ranges of rows with occupied = 0
in column
order (within the rows), calculate the counts of records in these groups and return the groups having sufficient value of COUNT(*)
.
To group something in SQL, we need to have an expression that would have the same value for all records belonging to the group.
How do we build such an expression for the continuous ranges?
In a movie theater, seats are numbered sequentially, so we can rely on the values of the column
being continuous. Now, let's calculate only the records that describe the free seats, and build their row numbers in the column
order:
Row | Column | Occupied | ROW_NUMBER |
---|---|---|---|
1 | 2 | 0 | 1 |
1 | 3 | 0 | 2 |
1 | 4 | 0 | 3 |
1 | 7 | 0 | 4 |
1 | 8 | 0 | 5 |
1 | 10 | 0 | 6 |
We see that filtering only free seats broke the column numbering order (the columns numbers are not continuous anymore), but the ROW_NUMBER
s are continuous (by definition). Each occupied seat breaks the continuity of the columns, and the spans of the free seats correspond to the ranges of columns with unbroken continuity.
If the continuity is not broken, column
and ROW_NUMBER
both increase by 1 with each record. This means that their difference will be constant:
Row | Column | Occupied | ROW_NUMBER | column - ROW_NUMBER |
---|---|---|---|---|
1 | 2 | 0 | 1 | 1 |
1 | 3 | 0 | 2 | 1 |
1 | 4 | 0 | 3 | 1 |
1 | 7 | 0 | 4 | 3 |
1 | 8 | 0 | 5 | 3 |
1 | 10 | 0 | 6 | 4 |
The value of difference between column
and ROW_NUMBER
, therefore, uniquely defines the continuous group the records belong to.
This means we can group on it and do other things (the rest being pure technical).
Let's create a sample table:
Read the rest of this entry »
SQL Server: Selecting records holding group-wise maximum (with ties)
Comments enabled. I *really* need your comment
A feedback on the yesterday's article on selecting records holding group-wise maximums in SQL Server.
Gonçalo asks:
Regarding the post I mention on the subject, wouldn't it be easier to obtain the result you're after using the SQL Server specific
SELECT TOP x WITH TIES
?
This clause (which is indeed specific to SQL Server) returns ties. TOP x WITH TIES
means return TOP x
records, plus all records holding the same values of the ORDER BY
expressions as the last (x
'th) record returned by the TOP
.
With a little effort this can be used for one of the queries I wrote about yesterday, namely the one that returns all records holding MIN(orderer)
within each group.
However, the efficiency of this solution varies greatly depending on the cardinality of both ordering and grouping columns.
Let's create a sample table:
Read the rest of this entry »
SQL Server: Selecting records holding group-wise maximum
Continuing the series on selecting records holding group-wise maximums:
How do I select the whole records, grouped on
grouper
and holding a group-wise maximum (or minimum) on other column?
In this article, I'll consider SQL Server.
SQL Server has very rich SQL syntax and its optimizer is really powerful.
However, some tricks are still required to make the queries like this to run faster.
Let's create a sample table:
Read the rest of this entry »
SQL Server: are the recursive CTE’s really set-based?
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Karl asks:
I've read your article about adjacency lists vs. nested sets in Oracle. A nice explanation, thanks.
You write that recursive operations are row-based in Oracle and set-based in SQL Server.
Which approach is better in your opinion?
First, a quick reminder for those who have no idea what's this all about.
Recursive queries are useful to build the hierarchy trees, traverse graphs, generate arbitrary rowsets etc. Usually this involves joining a set with itself an arbitrary number of times.
A recursive query is usually defined by the anchor part (the initial rowset, or the base of recursion) and the recursive part (some operation over the previous resultset).
This is same as with the plain recursion: take some initial data as a parameter, modify it in some way and then pass the result of the modification as a new parameter to the same function.
Since SQL operates with sets, the input and the output if this function should also be the sets. And here is the main differences between the approaches used by Oracle on one side, and SQL Server, PostgreSQL and some more systems on the other side.
Oracle uses CONNECT BY
approach, while SQL Server uses recursive CTE approach.
Read the rest of this entry »
SQL Server: clustered index and ordering
From Stack Overflow:
My table (SQL Server 2008) has more than 1 million records.
When I try to order records by
datetime
, it takes 1 second, but when I order byid
(integer), it only takes about 0.1 second.Is there any way to improve the efficiency?
The obvious answer would be create an index on
datetime
However, everything's not so simple.
Let's create a sample table and see why:
Read the rest of this entry »
Inverting date ranges
Comments enabled. I *really* need your comment
From Stack Overflow:
In an SQL table I keep bookings for various resouces, with a
StartDate
/EndDate
column:
ResourceID StartDate EndDate 1 2009-01-01 2009-01-05 1 2009-01-07 2009-01-10 2 2009-01-03 2009-01-18 A record means that the given resource is busy.
I need to produce a list of all resources that are available for at least X consecutive days in a given time interval, e. g. from 2009-01-01 to 2009-01-20 and 5 consecutive days.
This would be quite an easy task if the table records reflected available resources. We could just check that a resource exists using EXISTS
and that it has given length using DATEDIFF
.
However, the table reflects the busy resources. This means we should invert the date ranges: given a list of consecutive non-overlapping date ranges, we should build a list of their complements, or the gaps between them.
Here's how we do it.
First, we should build an ordered set of the date ranges. Since the ranges are not overlapping, the set can be ordered by any column (StartDate
or EndDate
), with the result being the same: