Archive for 2009
Constraints and the optimizer in SQL Server: FOREIGN KEY
Comments enabled. I *really* need your comment
Continuing on from the yesterday's article.
I was wondering: do the constraints declared on tables in SQL Server affect the decisions made by the optimizer?
SQL Server allows to define the following constraints on the columns:
PRIMARY KEY
UNIQUE
FOREIGN KEY
CHECK
DEFAULT
Today we will see how FOREIGN KEY
affect the plans.
A FOREIGN KEY guarantees that every value of the column constrained with it is contained in a PRIMARY KEY or UNIQUE
column of another table.
This fact can be used to simpify joins on these columns and using IN
clause.
Let's create the sample tables:
Read the rest of this entry »
Constraints and the optimizer in SQL Server: PRIMARY KEY and UNIQUE
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Mitch asks:
I was wondering: do the constraints declared on tables in SQL Server affect the decisions made by the optimizer?
SQL Server allows to define the following constraints on the columns:
PRIMARY KEY
UNIQUE
FOREIGN KEY
CHECK
DEFAULT
Since these constraints imply some restrictions on data, it gives a theoretical possibility to use these restrictions while optimizing the queries so that some extra checks (which had already been made by the constraints) can be skipped.
SQL Server does use these possibilities and optimizes its plans taking the constraints into account.
In this article we will see how SQL Server uses PRIMARY KEY
and UNIQUE
to optimize the plans.
PRIMARY KEY and UNIQUE
Let's create a pair of sample tables:
Read the rest of this entry »
Running root mean square: SQL Server
Comments enabled. I *really* need your comment
From Stack Overflow:
Lets say I have the following in a SQL Server table:
Value 1 3 2 6 3 I need to get the differences of each of these numbers (in order), then square them, sum them, divide by number of values and take the square root.
This value is called root mean square. It is widely used in statistics.
However, we should calculate it for the differences between the values, not for the values themselves. In Oracle, the differences between rows could be calculated easilty using LAG
, an analytical function that returns a previous row. Unfortunately, SQL Server lacks this function.
It could be emulated using a subquery or CROSS APPLY
along with ORDER BY
and TOP
, but this makes the query less elegant and less efficient for large tables (since using TOP
in subqueries and CROSS APPLY
forces NESTED LOOPS
).
It is better to assign the row number to each row (using ROW_NUMBER
) and then just make an equijoin between two consecutive rows. This looks much better and the optimizer is free to choose any join method, including Hash Join
which is most efficient when large percent of the table rows participates in a join, or a Merge Join
which will be also efficient, given that ROW_NUMBER()
tends to output the ordered resultsets and SQL Server can utilize this fact.
Here's the query:
WITH nums AS ( SELECT value, ROW_NUMBER() OVER (ORDER BY orderer) AS rn FROM source ) SELECT SQRT(AVG(POWER(np.value - nn.value, 2))) FROM nums np JOIN nums nn ON nn.rn = np.rn + 1
And let's try it on the sample data:
WITH source (orderer, value) AS ( SELECT 10, 1 UNION ALL SELECT 20, 3 UNION ALL SELECT 30, 2 UNION ALL SELECT 40, 6 UNION ALL SELECT 50, 3 ), nums AS ( SELECT value, ROW_NUMBER() OVER (ORDER BY orderer) AS rn FROM source ) SELECT SQRT(AVG(POWER(np.value - nn.value, 2))) FROM nums np JOIN nums nn ON nn.rn = np.rn + 1
2.6457513110645907 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0007s) |
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.
|--Compute Scalar(DEFINE:([Expr1027]=sqrt(CONVERT_IMPLICIT(float(53),[Expr1026],0)))) |--Compute Scalar(DEFINE:([Expr1026]=CASE WHEN [Expr1038]=(0) THEN NULL ELSE [Expr1039]/CONVERT_IMPLICIT(int,[Expr1038],0) END)) |--Stream Aggregate(DEFINE:([Expr1038]=COUNT_BIG(power([Union1011]-[Union1024],(2.000000000000000e+000))), [Expr1039]=SUM(power([Union1011]-[Union1024],(2.000000000000000e+000))))) |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([Expr1025])=([Expr1033]), RESIDUAL:([Expr1025]=([Expr1012]+(1)))) |--Sequence Project(DEFINE:([Expr1025]=row_number)) | |--Compute Scalar(DEFINE:([Expr1035]=(1))) | |--Segment | |--Sort(ORDER BY:([Union1023] ASC)) | |--Constant Scan(VALUES:(((10),(1)),((20),(3)),((30),(2)),((40),(6)),((50),(3)))) |--Compute Scalar(DEFINE:([Expr1033]=[Expr1012]+(1))) |--Sequence Project(DEFINE:([Expr1012]=row_number)) |--Compute Scalar(DEFINE:([Expr1037]=(1))) |--Segment |--Sort(ORDER BY:([Union1010] ASC)) |--Constant Scan(VALUES:(((10),(1)),((20),(3)),((30),(2)),((40),(6)),((50),(3))))
The query works, uses the Merge Join
and gives a correct result.
Minimum and maximum on bit fields: SQL Server
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query in SQL Server 2005:
SELECT * FROM tblJobs AS j -- joining other tables here WHERE Closed = 0 AND Invoiced = 0 AND Active = 1 AND DepartmentId <> 2 AND dbo.fncIsAllPointsDelivered(JobID) = 1This query is taking too long to run, and I know the problem is the UDF:
dbo.fncIsAllPointsDelivered(J.JobID) = 1
The SQL for the UDF is here:
DECLARE @DetailCount int DECLARE @TrackingCount int SELECT @DetailCount = COUNT(*) FROM tblLoadDetails WHERE JobId = @JobId SELECT @TrackingCount = COUNT(*) FROM tblLoadDetails WHERE JobId = @JobId AND Delivered = 1 IF (@DetailCount = @TrackingCount AND @DetailCount > 0) RETURN 1 RETURN 0All of this runs blazingly fast unless the job has a large number of load details in it.
I am trying to think of a way to either make the UDF faster or get rid of the need for the UDF, but I am at a loss.
UDF is definitely a problem here, since the conditions based on the UDF's are not sargable, that is an index cannot be used to search for them.
Let's look closed into this UDF.
Given a @JobID
, the function searches the load details for this job. For the function to succeed, all load details should have the state of Delivered
, and there must be at least one load detail for this job.
The UDF can be rewritten using IN
and NOT IN
conditions. The first condition makes sure that there is at least one load detail for the given job, the second one ensures that all load details have state Delivered
.
Let's create a sample table and see how it works:
Read the rest of this entry »
Oracle: nested SUM
From Stack Overflow:
Suppose I have this table
A B C D Datatypes are not important.
I want to do this:
SELECT a AS a1, b AS b1, c AS c1, ( SELECT SUM(d) FROM source WHERE a = a1 AND b = b1 ) AS total FROM source GROUP BY a, b, c, but I can't find a way (SQL Developer keeps complaining with
.)FROM
clause not foundIs there a way? Is it possible?
This is of course possible if we just alias the query and prepend the alias to the field:
SELECT a, b, c, ( SELECT SUM(d) FROM source si WHERE si.a = so.a AND si.b = so.b CONNECT BY 16 >= level ) FROM source so GROUP BY a, b, c
This works well on this sample set of data:
WITH source AS ( SELECT FLOOR(MOD(level - 1, 8) / 4) + 1 AS a, FLOOR(MOD(level - 1, 4) / 2) + 1 AS b, MOD(level - 1, 2) + 1 AS c, level AS d FROM dual CONNECT BY 16 >= level ) SELECT a, b, c, ( SELECT SUM(d) FROM source si WHERE si.a = so.a AND si.b = so.b ) FROM source so GROUP BY a, b, c
, but it needs to reevaluate the subquery for each group.
In Oracle, there is a better way to do this query: nesting an aggregate SUM
inside an analytical SUM
.
Read the rest of this entry »
IN list vs. range condition: MySQL
Answering questions asked on the site.
Princess asks:
Hello, I've got a problem with SQL.
I have a table which contains the production details for the factories. They are not factories and items of course but I cannot disclose the project and need to obfuscate so let's pretend they are :)
I need to select the items for which the first 5 factories have low production rate.
I tried to do the query like this:
SELECT ProductionItem FROM FactoryProductions WHERE 5 >= FactoryID AND 100 >= ProductionAmountwhich returns correct results but is slow.
I have an index on
(FactoryID, ProductionAmount)
.There are 13 Factories and 2,300,000 Items
This is in MySQL
This is a nice illustration of how index range evaluation works in MySQL and how to optimize it.
We will assume that the ProductionAmount
is usually much higher than 100
Now, let's create a sample table:
Read the rest of this entry »
Shuffling rows: PostgreSQL
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Josh asks:
I am building a music application and need to create a playlist of arbitrary length from the tracks stored in the database.
This playlist should be shuffled and a track can repeat only after at least 10 other tracks had been played.
Is it possible to do this with a single SQL query or I need to create a cursor?
This is in PostgreSQL 8.4
PostgreSQL 8.4 is a wise choice, since it introduces some new features that ease this task.
To do this we just need to keep a running set that would hold the previous 10 tracks so that we could filter on them.
PostgreSQL 8.4 supports recursive CTE's that allow iterating the resultsets, and arrays that can be easily used to keep the set of 10 latest tracks.
Here's what we should do to build the playlist:
- We make a recursive CTE that would generate as many records as we need and just use LIMIT to limit the number
- The base part of the CTE is just a random record (fetched with
ORDER BY RANDOM() LIMIT 1
) - The base part also defines the queue. This is an array which holds 10 latest records selected. It is initialized in the base part with the
id
of the random track just selected - The recursive part of the CTE joins the previous record with the table, making sure that no record from the latest 10 will be selected on this step. To do this, we just use the array operator
<@
(contained by) - The recursive part adds newly selected record to the queue. The queue should be no more than 10 records long, that's why we apply array slicing operator to it (
[1:10]
)
Let's create a sample table:
Read the rest of this entry »
Date range with possible NULL values
From Stack Overflow:
Using SQL Server 2008.
I have a stored proc which has start and end date as input parameters for date range.
Looking for a single SQL query which has a
BETWEEN
predicate on start and end date in theWHERE
clause which can handle both cases where the dates are either bothNULL
or both have values.
This can be done by writing an IS NULL
check in the expression, like this:
WHERE (@startDate IS NULL OR date >= @startDate) AND (@endDate IS NULL OR date >= @endDate)
or by using a COALESCE
on the value of the date itself (so that the range always matches):
WHERE date BETWEEN ISNULL(@startDate, date) AND ISNULL(@endDate, date)
However, none of these conditions uses an index on date
efficiently.
Let's create a sample table ans see:
Read the rest of this entry »
Bitwise operations and indexes
From Stack Overflow:
Are the following queries efficient in MySQL:
SELECT * FROM table WHERE field & number = number -- to find values with superset of number's bitsSELECT * FROM table WHERE field | number = number -- to find values with subset of number's bits, if an index for the
field
has been created?If not, is there a way to make it run faster?
An index can be used for the following things:
- To limit the number of records scanned
- To lower the number of row lookups
When doing a full table scan, every record should be fetched and examined. If the table contains say, 1,000,000 records, and each record is 100 bytes long, then 100 Mb worth of data should be processed. These data are usually scanned sequentially.
An index row, on the other hand, contains the key value, the row pointer and some additional information (pointers to the parents and children). On a MyISAM table, each row of the index on an INT
column occupies 10 bytes (4 (sizeof(INT)
) + 6 (default MyISAM pointer size)) plus some overhead (block headers, leaf pointers etc).
Even if we cannot build a range on the index and need to look over all index values, 10 Mb is far less than 100 Mb.
However, scanning the index has two drawbacks:
- Traversing the B-Tree is more slow than a full table scan, since the former is not sequential
- When the condition we search for matches the index key, we should follow the pointer to fetch the unindexed values from the table row. This also takes some time.
Both these things can be optimized:
- Oracle has
INDEX FAST FULL SCAN
access method. It scans the index rows in their physical order, not logical. The key order is not maintained in such a scan, but we not always need it anyway.This makes the index scanning sequential (and taking the constant time)
- PostgreSQL has bitmap access method. PostgreSQL cannot disregard the index order as Oracle does so it has to traverse the tree anyway. But instead of following the pointer to fetch the values it fills a special bitmap with a bit set for each matching row.
The order of bits corresponds to the physical order of rows.
When it's over with setting bits it just reads the bitmap and fetches the rows with the bits set. Since the bits are ordered, the rows fetched are ordered too, so this scan is also sequential. Scanning the bitmap takes constant time (which is usually negligible since bitmaps are very small), pure time for fetching the rows depends on the number of bits set but is no longer than a full scan (since the scan is sequential).
MySQL, however, does neither of these optimizations.
IN vs. JOIN vs. EXISTS: Oracle
Answering questions asked on the site.
Jason asks:
I have a co-worker who swears that Oracle
IN
queries are slow and refuses to use them. For example:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )Typically, this kind of performance advice strikes me as overgeneralized and my instinct is to ignore it. But I figure I'll give him the benefit of the doubt and ask about it.
So, in general is an
IN
query very expensive? I'm having trouble putting together many non-trivial queries to run onEXPLAIN PLAN
.
Does the IN
predicate always have inferior efficiency compared to it's counterparts, EXISTS
and JOIN
?
Let's check.
First of all, there are at least three ways to check each row in a table against a list of (possible duplicate) values and return each row at most once:
IN
predicate:SELECT foo FROM bar WHERE bar.stuff IN ( SELECT stuff FROM asdf )
EXISTS
predicate:SELECT foo FROM bar WHERE EXISTS ( SELECT NULL FROM asdf WHERE asdf.stuff = bar.stuff )
JOIN
/DISTINCT
:SELECT b.foo FROM ( SELECT DISTINCT stuff FROM asdf ) a JOIN bar b ON b.stuff = a.stuff
All these queries are semantically the same.
Common wisdom advices against using IN
predicates because it just doesn't look a right thing to do.
This is because less experienced SQL developers tend to translate the SQL statements into pseudocode just like they see it, which in this case looks something like this:
foreach ($bar as $bar_record) { foreach ($asdf as $asdf_record) { if ($bar_record->stuff == $asdf_record->stuff) output ($bar_record); break; } }
This just radiates inefficiency, since the inner rowset should be iterated for each row returned.
EXISTS
looks more nice since it at least gives a hint of possibility to use an index. However, it still looks like the same nested loops.
Finally, the JOIN
looks the most promising here since everybody knows joins are optimized (though few know how exactly). But an IN
? No thanks, they say, it will execute once for every row, it's slow, it's bad, it's inefficient!
People, come on. Oracle developers have thought it over ages ago and guess what: they turned out to be smart enough to implement an efficient algorithm for an IN
construct.
Now, let's make two sample tables and see how it's done:
Read the rest of this entry »