Author Archive
Constant vs. NULL to mark missing values in OUTER JOINs
Comments enabled. I *really* need your comment
From Stack Overflow (shortened for clarity):
I have two tables:
homes
andhome_photos
:It's very possible that a home does not have a photo associated with the home.
In that case, the column
primary_photo_group_id
will have value 0.Otherwise,
primary_photo_group_id
equals thegroup_id
of the photo.How do I return a
NULL
if the photo if not defined for a home?
This is of course a classic task for an OUTER JOIN
:
SELECT * FROM homes h LEFT JOIN photos p ON p.home_photo_group_id = h.primary_photo_group_id AND …
This construct, however, can be optimized, and I'll show how.
Let's create and fill the sample tables:
Read the rest of this entry »
IN with a comma separated list: SQL Server
From Stack Overflow:
How do I search a table for all values from a comma separated list, or return all values if the list contains a single 0?
Using comma separated lists in SQL Server as a search parameter is generally considered a bad design and should be avoided.
One should pass lists in a table variable rather than in a set of comma separated values.
However, there are legacy applications, ORM's and these kinds of stuff that just leave you no choice.
MySQL provides a nice function find_in_set
designed exactly for this task.
SQL Server does not directly support an IN
predicate for CSV
, neither it provides find_in_set
. It needs a rowset for IN
predicate to work.
To expand a comma separated list into a rowset, we will use a recursive CTE
.
In the CTE
, we should search for the position of every next comma and return it.
Then we can just take the substring between two commas and cast it into an INT
:
DECLARE @lid VARCHAR(100) SET @lid = '3, 14, 15, 296, 5358'; WITH cd AS ( SELECT 1 AS first, CHARINDEX(',', @lid, 1) AS next UNION ALL SELECT next + 1, CHARINDEX(',', @lid, next + 1) FROM cd WHERE next > 0 ), lid AS ( SELECT CAST(SUBSTRING(@lid, first, CASE next WHEN 0 THEN LEN(@lid) + 1 ELSE next END - first)AS INT) AS id FROM cd ) SELECT * FROM lid
id |
---|
3 |
14 |
15 |
296 |
5358 |
Now we just need to return the values from the table found in this list, or all values if there is a single 0 in the list.
Let's create a sample table and build a query to do this:
Read the rest of this entry »
Indexing tiny tables
From Stack Overflow:
Hypothetically, in a SQL Server database, if I have a table with two
INT
fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the twoINT
fields overcomes the overhead imposed by said indexes?
Very nice question.
Imagine we have a table table1 (col INT NOT NULL, value1 INT NOT NULL, value2 VARCHAR(100) NOT NULL, …)
with an index on col
.
In SQL Server, an index can improve the following things.
-
Index lookups (range scans). An index can be used to limit the number of rows scanned:
SELECT * FROM table1 t1 WHERE t1.col BETWEEN @start AND @end
- Secondary data source. This query will get all data it need from the index:
SELECT t1.col FROM table1 t1
- Ordered rowsets. This query will not need to sort:
SELECT * FROM table1 t1 ORDER BY t1.col
- Efficient
MIN
/MAX
:SELECT MIN(col) FROM table1 t1
The latter case is probably the best illustration of the fact that even a table of 2 rows can benefit from creating an index.
Let's create the sample tables:
Read the rest of this entry »
Click
Back in the end of 90's I had been employed with a small startup which was founded to develop a restaurant management system.
The job listing they posted said PostgreSQL wanted
(sic). And PostgreSQL for sure was something I knew about. You see, I already had experience with PostgreSQL by that time. I was able to install PostgreSQL, create a database, run a query against it and even create a trigger in C (I didn't yet know of pl/pgSQL).
So I applied for the job, got hired and started coding.
Coding what you may ask?
Well, I had no idea. And in fact, nobody did. There were only two requirements for the system.
First, it had to work somehow with a Casio Cassiopeia, which allowed a wireless network module to be installed and used. A waiter had to be able to take your order and it had to get to the kitchen instantly.
Second one was personal. The founders of the company (and me too) used to dine in a small cafe at the basement of the office building. They had a menu, but it didn't correlate with reality. You literally couldn't get anything that was in the menu. They were constantly out of meat, out of tomatoes, out of potatoes (potatoes!). Instead they put a small hand-written piece of paper into the menu which listed everything the had at the time (a soup and a something else), that's if they were in good mood. If they were not, you just placed an order, and a waiter came back in 10 minutes to tell you it was beyond their ability to get some tomatoes and cucumbers to chop you some salad, try ordering something else, thank you.
In our system, that was never ever going to happen. Never. If the kitchen was short of something they needed to cook you a dish you ordered, the system would warn the waiter instantly, and she would tell you to order something else right this minute, that's instead of going to the kitchen and back.
It had to be a killer feature.
Read the rest of this entry »
NVL vs. COALESCE
From Stack Overflow:
Are there non obvious differences between
NVL
andCOALESCE
in Oracle?The obvious differences are that
COALESCE
will return the first non-NULL
item in its parameter list whereasNVL
only takes two parameters and returns the first if it is notNULL
, otherwise it returns the second.It seems that
NVL
may just be abase caseversion ofCOALESCE
.Am I missing something?
In case of two agruments of same type, these functions are basically the same: they both return first non-NULL
values.
However, they are implemented differently:
NVL
always evaluates both argumentsCOALESCE
doesn't evaluate anything after it finds the first non-NULL
argument
This behavior can be utilized.
Here's a real world example from a restaurant management system I developed about 10 years ago.
A bill that is presented to a customer consists of the items he or she ordered.
Majority of the dishes ordered are plain menu items: something that has its price printed in the menu.
However, there are some special things which are not included to the menu: a customer may want to order something exclusive and pay for it.
The manager should estimate the cost of ingredients, time and effort to make this dish, then calculate the price and name it: everything in 30 seconds.
These things are of course come overpriced, but the customers that make such orders don't seem to care much about money.
In the bill that appears as special order
. When a manager enters it into the computer, she should also enter the price she had calculated.
These prices then get back to the server and are stored in a table (not the one where regular menu prices are stored).
At the end of month, a report to the month's receipts should be generated.
Here are the tables containing the data relevant to the reports. They of course are simplified, but the principle remains the same.
Read the rest of this entry »
Ordering NULLs
Comments enabled. I *really* need your comment
From Stack Overflow:
In our product we have a generic search engine, and trying to optimize the search performance.
A lot of the tables used in the queries allow
NULL
values.Should we redesign our table to disallow
NULL
values for optimization or not?
NULL
values surely affect some optimizer decisions in Oracle.
It's widely known that Oracle won't include a column into index if all key values are NULL
.
This means that if we don't apply a filter which would imply non-NULL
condition on the column (like, any comparison operator in the WHERE
clause, or a comparison operation in the ON
condition of a JOIN
clause), the index is not guaranteed to return all values, and a FULL SCAN
will be required on the table.
It is less evident that ORDER BY
queries can be affected too.
Let's create a sample table and see how:
Read the rest of this entry »
Counting bans
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a MyISAM table in MySQL 5.0 like this:
id login ip banned I would like to find all users not banned (
banned = 0
) if at least 5 other users with the sameip
have already been banned (banned = 1
).
To do this, we should first find which IP
's have more than 5 bans and then find all non-banned users for these IP
's.
To find the IP
's we can use two methods:
- Find all banned
IP
's, group them and filter out those that haveCOUNT(*) < 5
- For each
IP
possible, find if there are 5 banned records for thisIP
using aLIMIT
clause
The first method doesn't require doing an extra JOIN
on the table, but it will do count all values just to figure out if the COUNT(*)
is more or less than 5.
The second method stops counting whenever it find the 5th value, but it requires an extra JOIN
.
Counting is bad if there are many values to count, and the extra join is bad if there are many JOIN
loops.
This means that the first method should be faster if there are lots of banned records, the, and the second method is faster for few banned records and few distinct IP
's.
Let's create two sample tables and see:
Read the rest of this entry »
Efficient EXISTS
Comments enabled. I *really* need your comment
In my previous article:
I compared efficiency of different methods to check for existence of a value in a subquery resultset.
These methods included IN
and EXISTS
predicates.
It was shown that in SQL Server, a poorly written EXISTS
can be very slow, while an IN
predicate is almost always efficiently optimized using one of the numerous SEMI JOIN
methods.
However, a properly written EXISTS
predicate can outperform IN
.
Let's create a sample table and see how:
Read the rest of this entry »
IN vs. JOIN vs. EXISTS
From Stack Overflow:
I'm wanting to select rows in a table where the primary key is in another table.
I'm not sure if I should use a
JOIN
or theIN
operator in SQL Server 2005:SELECT * FROM a WHERE a.c IN ( SELECT d FROM b )SELECT a.* FROM a JOIN b ON a.c = b.dIs there any significant performance difference between these two SQL queries with a large dataset (i. e. millions of rows)?
Before we begin I should note that the queries provided by the author are not similar in fact and could produce different resultsets.
If the table used in the subquery returns a value twice, a JOIN
will also return the matching rows twice, while an IN
condition will return them only once.
That's why it's more correct to compare to the following query:
SELECT a.* FROM a JOIN ( SELECT DISTINCT d FROM b ) bo ON a.c = bo.d
Now, à nos moutons.
The folk wisdom advises against using IN
constructs in favor of JOIN
's because they say that the subquery will be executed many times in a loop
.
Let's create some sample tables and see if it's really so:
Read the rest of this entry »
Group based identity
Comments enabled. I *really* need your comment
From Stack Overflow:
I'm working with an application that adds a new row to the database, based on the last row meeting a certain criteria.
Here is an over-simplified visualization:
A1 A2 A3 B1 B2 Using the visualization above, a web page loads up the highest B value, which is 2.
Then, after some time, it wants to insert B3, the next record in the series.
However, it has to check to make sure that someone else didn't do the same thing.
In actuality, each row corresponds to a place that a piece of equipment visited in a factory.
The page (not easy to rewrite), currently moves the equipment to the next step.
Is there a standard pattern for dealing with this type of problem, or do I simply need to lock the table?
This question combines two other frequently asked questions:
- How do I make group-based identity (i. e. keep several identity series identified by the value of a certain column)
- How do I ensure there will be no identity gaps?
While there are numerious solutions that involve some kind of selecting MAX(column)
and updating the row inside a transaction, there is a more simple way using linked lists.
We need to create the table as following:
Read the rest of this entry »