How to create fast database queries

Archive for the ‘SQL Server’ Category

Efficient circle distance testing

Comments enabled. I *really* need your comment

Answering questions asked on the site.

eptil asks:

I am using SQL Server 2008, but not the spatial features.

I have a table with few entries, only 40,000. There is an id INT PRIMARY KEY column and two columns storing a 2d coordinate, both decimals.

I would like to find all the records that do not have other records within a given radius. The query I am using at the moment is:

SELECT id, x, y
FROM mytable t1
FROM mytable t2
WHERE ABS(t1.x - t2.x) < 25 AND ABS(t1.y - t2.y) < 25 ) = 1 [/sourcecode]
This is taking 15 minutes to run at times.

Is there a better way?

Of course using spatial abilities would be a better way, but it is possible to make do with plain SQL. This will also work in SQL Server 2005.

In most database engines, the spatial indexes are implemented as the R-Tree structures. SQL Server, however, uses another approach: surface tesselation.

Basically, it divides the surface into a finite number of tiles, each assigned with a unique number. The identifiers of tiles covered by the object are stored as keys of a plain B-Tree index.

When SQL Server's optimizer sees a geometrical predicate against an indexed column, it calculates the numbers of tiles that possibly can satisfy this predicate. Say, if the tiles are defined as squares with side 1, the predicate column.STDistance(@mypoint) < 2 can only be satisfied by the objects within 2 tiles away from @mypoint's tile. This gives a square of 25 tiles with @mypoint's tile in the center. The tile numbers can be found and searched for using the index. Exact filtering condition is then applied to each candidate value returned by the index.

Same solution can be used in our case even without the spatial functions. Comparing tile numbers is an equijoin and hash join method is eligible for this operation. We can even choose the tiling algorithm individually for each query, since we don't have to store the tile identifiers in the table, and the hash table will be built dynamically anyway.

Let's create a sample table and see how it works:
Read the rest of this entry »

Written by Quassnoi

February 26th, 2010 at 11:00 pm

Posted in SQL Server


with 2 comments

Answering questions asked on the site.

myst asks:

I have two really large tables with lots of columns, many of them are nullable

I need to remove all rows from table1 which are not present in table2, but there can be duplicates in both tables and writing more than 70 IS NULL conditions would be a pain, and I want to make sure there's nothing I'm missing.

Is there a more simple way?

SQL Server supports EXCEPT clause which returns all records present in the first table and absent in the second one. But this clause eliminates duplicates and cannot be used as a subject to a DML operation.

ANSI SQL standard describes EXCEPT ALL which returns all records from the first table which are not present in the second table, leaving the duplicates as is. Unfortunately, SQL Server does not support this operator.

Similar behavior can be achieved using NOT IN or NOT EXISTS constructs. But in SQL Server, IN predicate does not accept more than one field. NOT EXISTS accepts any number of correlated columns, but it requires extra checks in the WHERE clause, since equality operator does not treat two NULL values as equal. Each pair or nullable columns should be additionally checked for a NULL in both fields. This can only be done using OR predicates or COALESCE, neither of which adds to performance.

But there is a way to emulate EXCEPT ALL in SQL Server quite elegantly and efficiently.

Let's create two sample tables:
Read the rest of this entry »

Written by Quassnoi

February 10th, 2010 at 11:00 pm

Posted in SQL Server

SQL Server: running totals

with one comment

From Stack Overflow:

We have a table of transactions which is structured like the following :

TranxID ItemID TranxDate TranxAmt

TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by.

Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the TranxDate when this occurred.

Note that TranxDate is not unique.

SQL Server is a very nice system, but using it for calculating running totals is a pain.

Oracle supports additional clauses for analytic functions, RANGE and ROWS, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.

SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM and COUNT it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.

The common way to write such a running total query is using a subquery or a self join which would count the SUM of all previous records. However, the complexity of this query is O(n^2) and it's not usable for any real volumes of data.

This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.

This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

January 22nd, 2010 at 11:00 pm

Posted in SQL Server

Calculating mode

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

        SELECT  e_id, 
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

  1. Return every modal value
  2. Return a single modal value
  3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server

Grouping continuous ranges

with one comment

From Stack Overflow:

I have a ticketing system.

Now I have to select adjacent places when the user asks for 2 or 3 tickets.

Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers.

These tickets are in an SQL Server database.

Any ideas about this algorithm to search for available adjacent seats?

This is a problem known as grouping continuous ranges: finding the continuous ranges of the records (in a certain order) having the same value of the grouping column.

For the table in question, the groups will look like this:

Row Column Occupied Group
1 1 1 -
1 2 0 1
1 3 0 1
1 4 0 1
1 5 1 -
1 6 1 -
1 7 0 2
1 8 0 2
1 9 1 -
1 10 0 3

To find the spans of free seats with the required length, we need to group the continuous ranges of rows with occupied = 0 in column order (within the rows), calculate the counts of records in these groups and return the groups having sufficient value of COUNT(*).

To group something in SQL, we need to have an expression that would have the same value for all records belonging to the group.

How do we build such an expression for the continuous ranges?

In a movie theater, seats are numbered sequentially, so we can rely on the values of the column being continuous. Now, let's calculate only the records that describe the free seats, and build their row numbers in the column order:

Row Column Occupied ROW_NUMBER
1 2 0 1
1 3 0 2
1 4 0 3
1 7 0 4
1 8 0 5
1 10 0 6

We see that filtering only free seats broke the column numbering order (the columns numbers are not continuous anymore), but the ROW_NUMBERs are continuous (by definition). Each occupied seat breaks the continuity of the columns, and the spans of the free seats correspond to the ranges of columns with unbroken continuity.

If the continuity is not broken, column and ROW_NUMBER both increase by 1 with each record. This means that their difference will be constant:

Row Column Occupied ROW_NUMBER column - ROW_NUMBER
1 2 0 1 1
1 3 0 2 1
1 4 0 3 1
1 7 0 4 3
1 8 0 5 3
1 10 0 6 4

The value of difference between column and ROW_NUMBER, therefore, uniquely defines the continuous group the records belong to.

This means we can group on it and do other things (the rest being pure technical).

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 30th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: Selecting records holding group-wise maximum (with ties)

Comments enabled. I *really* need your comment

A feedback on the yesterday's article on selecting records holding group-wise maximums in SQL Server.

Gonçalo asks:

Regarding the post I mention on the subject, wouldn't it be easier to obtain the result you're after using the SQL Server specific SELECT TOP x WITH TIES?

This clause (which is indeed specific to SQL Server) returns ties. TOP x WITH TIES means return TOP x records, plus all records holding the same values of the ORDER BY expressions as the last (x'th) record returned by the TOP.

With a little effort this can be used for one of the queries I wrote about yesterday, namely the one that returns all records holding MIN(orderer) within each group.

However, the efficiency of this solution varies greatly depending on the cardinality of both ordering and grouping columns.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 1st, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: Selecting records holding group-wise maximum

with one comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll consider SQL Server.

SQL Server has very rich SQL syntax and its optimizer is really powerful.

However, some tricks are still required to make the queries like this to run faster.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 30th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: are the recursive CTE’s really set-based?

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Karl asks:

I've read your article about adjacency lists vs. nested sets in Oracle. A nice explanation, thanks.

You write that recursive operations are row-based in Oracle and set-based in SQL Server.

Which approach is better in your opinion?

First, a quick reminder for those who have no idea what's this all about.

Recursive queries are useful to build the hierarchy trees, traverse graphs, generate arbitrary rowsets etc. Usually this involves joining a set with itself an arbitrary number of times.

A recursive query is usually defined by the anchor part (the initial rowset, or the base of recursion) and the recursive part (some operation over the previous resultset).

This is same as with the plain recursion: take some initial data as a parameter, modify it in some way and then pass the result of the modification as a new parameter to the same function.

Since SQL operates with sets, the input and the output if this function should also be the sets. And here is the main differences between the approaches used by Oracle on one side, and SQL Server, PostgreSQL and some more systems on the other side.

Oracle uses CONNECT BY approach, while SQL Server uses recursive CTE approach.
Read the rest of this entry »

Written by Quassnoi

November 18th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: clustered index and ordering

with 2 comments

From Stack Overflow:

My table (SQL Server 2008) has more than 1 million records.

When I try to order records by datetime, it takes 1 second, but when I order by id (integer), it only takes about 0.1 second.

Is there any way to improve the efficiency?

The obvious answer would be create an index on datetime

However, everything's not so simple.

Let's create a sample table and see why:
Read the rest of this entry »

Written by Quassnoi

November 11th, 2009 at 11:00 pm

Posted in SQL Server

Inverting date ranges

Comments enabled. I *really* need your comment

From Stack Overflow:

In an SQL table I keep bookings for various resouces, with a StartDate / EndDate column:

ResourceID StartDate EndDate
1 2009-01-01 2009-01-05
1 2009-01-07 2009-01-10
2 2009-01-03 2009-01-18

A record means that the given resource is busy.

I need to produce a list of all resources that are available for at least X consecutive days in a given time interval, e. g. from 2009-01-01 to 2009-01-20 and 5 consecutive days.

This would be quite an easy task if the table records reflected available resources. We could just check that a resource exists using EXISTS and that it has given length using DATEDIFF.

However, the table reflects the busy resources. This means we should invert the date ranges: given a list of consecutive non-overlapping date ranges, we should build a list of their complements, or the gaps between them.

Here's how we do it.

First, we should build an ordered set of the date ranges. Since the ranges are not overlapping, the set can be ordered by any column (StartDate or EndDate), with the result being the same:

Read the rest of this entry »

Written by Quassnoi

November 9th, 2009 at 11:00 pm

Posted in SQL Server