Archive for 2009
SQL Server: aggregate bitwise OR
From Stack Overflow:
I am creating a script for merging and deleting duplicate rows from a table.
The table contains address information, and uses an integer field for storing information about the email as bit flags (column name
value
). For example, if bit 1 is set invalue
, that means the record is a primary address.There are instances of the same email being entered twice, but sometimes with different
value
s. To resolve this, I need to take thevalue
from all duplicates, assign them to one surviving record and delete the rest.My biggest headache so far as been with the merging of the records. What I want to do is bitwise
OR
allvalue
s of duplicate records together.
From database theory's point of view, this design of course violates the 1NF, since multiple properties are contained in one column (in bit-packed form). It would be easier to split them apart and create a separate column for each bit.
However, it can be a legitimate design if the fields are not parsed on the database side, but instead passed as-is to a client which needs them in this bit-packed form. And anyway, helping is better than criticizing.
We have three problems here:
- Select a first record for each set of duplicates
- Update this record with bitwise
OR
of all values in its set - Delete all other records
Step 1 is easy to do using ROW_NUMBER()
.
Step 3 is also not very hard. Microsoft has a knowledge base article KB139444 that described a really weird way to remove the duplicates, but it may be done much more easily using same ROW_NUMBER() with a CTE or an inline view.
See this article I wrote some time ago on how to do this:
Now, the main problem is step 2.
SQL Server lacks a native way to calculate bitwise aggregates, but with a little effort it can be emulated.
The main idea here is that for bit values, aggregate OR
and AND
can be replaced with MAX
and MIN
, accordingly.
All we need is to split each value into the bits, aggregate each bit and merge the results together.
Let's create a sample table:
Read the rest of this entry »
Double-thinking in SQL
One of the first things a novice SQL developer learns about is called thinking in SQL
, which is usually being opposed to procedural thinking
Let's see what part of brain does this intellectual activity take and how to use it.
Two features distinguish SQL from other languages you learned as a 11-year old kid on your first PC, like BASIC or perl or maybe even C++ if you're such a Wunderkind.
First, SQL is set-based. It does things with sets.
Every tool is designed to do things with something else. Like, you use a hammer to do things with nails, or use a screwdriver to do things with screws, or use an oven to do things with food.
Same with computer languages.
BASIC does things with variables. perl does things with scalars, arrays, hashes and file streams. Assembly does things with registers and memory.
You should not be confused by something like registers are just a special case of variables
, or a hash is just a generalized container which exposes this and this method
or something like that. No.
A hash is a hash, a variable is a variable and a register is a register.
Like, an egg is a food and rice is a food and it's possible to cook some eggs in a rice cooker and vice versa, but they are just wrong tools to do that.

Akihabara district in Tokyo City. You can buy any type of gadget you can imagine, which is about 2% of what you will find there.
Prehistoric men had to make do with hammerstones and lithic blades (even to court their women), but now we have a whole district in Tokyo City for gadgets with USB type A, and another district for gadgets with USB type B.
So if you feel the urge to hash something and then make a good old array out of this, you don't use assembly, but perl or PHP instead.
Same with SQL. SQL does things with sets.
It's a tool that allows you to take a dozen or two of sets, mix them together, knead and wedge them then chop them apart and mix again, but the output you get is still a set and all inputs are sets.
Eeverything you do in SQL, you do it on sets. That's why SQL is called a set-oriented language.
Ok, that was the first feature that distinguishes SQL from other languages. What's the second one?
SQL is a declarative language. This means that you express what
you want to do with sets, not how
you want to do it.
This requires a little explanation.
Read the rest of this entry »
Selecting compatible articles
Comments enabled. I *really* need your comment
From Stack Overflow:
I need to formulate an SQL query that returns all articles that are compatible to a set of other articles (of arbitrary size).
So for a list of article numbers
A, B,… , N
the question is:Give me all articles that are compatible with
A and B and … and N
For example, consider the table
A B 1 2 3 1 3 4 If I wanted all articles that are compatible with 1, the query would return (2, 3).
The query generated by the list (2, 3) will return 1, whilst the query generated from the list (1, 3) generates an empty list.
This table describes a friendship: a symmetric irreflexive binary relation.
That is:
- For any given
a
,b
, ifa
is a friend tob
, thenb
is a friend toa
- For any given
a
,a
is never a friend to itself
This relation is heavily used by social networks.
A normalized table describing this relation should be defined like this:
CREATE TABLE t_set (
a INT NOT NULL,
b INT NOT NULL
)
ALTER TABLE t_set ADD CONSTRAINT pk_set_ab PRIMARY KEY (a, b)
ALTER TABLE t_set ADD CONSTRAINT ck_set_ab CHECK (a < b)
[/sourcecode]
, the check being added to account for the relation symmetry.
Complete relation can be retrieved with the following query:
Read the rest of this entry »
Selecting birthdays
Comments enabled. I *really* need your comment
Answering questions asked on the site.
James asks:
I'm developing a forum and want to select all users that have a birthday within the next 3 days.
How do I do it?
This is in SQL Server
This is a very nice feature which every decent forum should have, and I'll be glad to answer this question.
Unfortunately you didn't provide the names of your tables, so I'll have to make them up.
It's not much of a stretch to assume that your table is called t_user
and you keep the user's birthdates in a DATETIME
field called birthdate
.
A birthday within next 3 days means that if you add the person's age to the person's birthdate, you get a date between the current date and three days after it.
To check this, we just need to calculate the number of months between the dates and make sure that it's divisible by 12 with reminder of 0 or 11 (to handle month transitions).
Then we need to add a transition month and divide the number of months by 12. The quotient will give us the number of years we need to add to the birthdate to compare the result with GETDATE()
.
Let's create a sample table and see how to do it:
Read the rest of this entry »
Finding duplicates on either of the fields
Comments enabled. I *really* need your comment
From Stack Overflow:
This query for creating a list of сandidate duplicates is easy enough:
SELECT COUNT(*), Can_FName, Can_HPhone, Can_EMail FROM can GROUP BY Can_FName, Can_HPhone, Can_EMail HAVING COUNT(*) > 1But if the actual rule I want to check against is
FName
and(HPhone OR Email)
— how can I adjust theGROUP BY
to work with this?
This is quite a common need.
The problem with conditions like that is that GROUP BY
will not work on them since they are not transitive.
Let's imagine a sample dataset:
id | name | phone | |
---|---|---|---|
1 | John | 555-00-00 | john@example.com |
2 | John | 555-00-01 | john@example.com |
3 | John | 555-00-01 | john-other@example.com |
4 | James | 555-00-00 | james@example.com |
5 | James | 555-00-01 | james-other@example.com |
We see that records 1 and 2 are duplicates
(I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.
But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.
Given this, we should use similarity chains instead of GROUP BY
.
A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).
Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.
This set can be identified by the id
of the least node.
We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.
Let's create a sample table:
Read the rest of this entry »
Selecting last forum posts
Comments enabled. I *really* need your comment
From Stack Overflow:
I built a custom forum for my site using MySQL.
The listing page is essentially a table with the following columns: Topic, Last Updated, and Replies.
The DB table has the following columns:
id name body date topic_id A topic has the
topic_id
of 0, and replies have thetopic_id
of their parent topic.How do I query 20 last updated topics efficiently?
We have two kinds of topics here: those that had been answered and those that had not.
The queries for these two kinds of topics should be optimized using different methods and it's better to query for them using two different queries merged later using UNION ALL
:
- To find last 20 replies, we will employ the technique described in this article:
This technique uses a composite index on
(date, id)
to find 20 latest replies to distincttopic_id
's:- First, it scans the index on
(date, id)
to retrieve records with a non-zerotopic_id
in descending date order. - For each record found on step 1, it scans the index on
(date, id)
again in descending order and finds the first record with the sametopic_id
. - If the
id
's of records found on steps 1 and 2 match, this means that the record found on step 1 is the latest for thistopic_id
and we should return it.
- First, it scans the index on
- To find last 20 unreplied topics, we just use a good old
NOT EXISTS
.In this case a composite index on
(topic_id, date)
will come handy: we will need to select the last20
unreplied topics, and this index will be useful to implement both filtering ontopic_id
and ordering bydate DESC
. - Finally, we will merge these resultsets using
UNION ALL
and limit the result.
Now, let's create a sample table:
Read the rest of this entry »
Overlapping numbers
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Kat asks:
I want to determine if I have any overlapping ranges of numbers in a table for specific
id
's.For example, the table has the following:
ID START_NUMBER END_NUMBER 1 3035112000 3035112010 2 3035112011 3035112020 3 3035112003 3035112005 4 3032925000 3032925010 1 3035113000 3035113050 2 3035113000 3035113050 3 3035113051 3035113052 4 3032925500 3032925550 I only want to pull those that have overlapping number ranges for
id
's 1, 2, and 3 (but not 4).This is in Oracle.
We need to write a query to pull out all id
's of rows that overlap at least one other row in the table.
Let's create a pair of sample tables:
Read the rest of this entry »
Schema junk
Many tasks in SQL world can be efficiently performed using stored procedures.
All database management systems of the Big Four
(that is Oracle, SQL Server, MySQL and PostgreSQL) have a concept of a stored procedure: a little server-managed program written in a procedural language that allows running SQL statements and returning results.
They are widely recommended for use, since they allow easy separation of business rules from data storage rules, fine-grained security, integrity control and all other things software analysts mumble for their living.
People widely accept this theory and if you ask anyone should I use the stored procedures?
, the answer you will get will be a unanimous yes.
The problem is that people rarely use them in fact in real applications. Everybody knows they should but nobody really does.
More than that: despite the fact that stored procedures in fact can be more efficient than plain queries (even for one-time tasks), people often prefer the latter. Writing a procedure to obtain a resultset seems to be an overkill
from the point of view of most developers, they rather prefer doing it with a plain SQL query.
Why so?
Let me tell you another boring story. Don't worry, it won't take long.
Read the rest of this entry »
Searching for NULL columns
Comments enabled. I *really* need your comment
From Stack Overflow:
I want to make sure I'm not inserting a duplicate row into my table (i. e. only
PRIMARY KEY
different).All my fields allow
NULL
's as I've decidedNULL
to meanall values.Because of
NULL
's, the following statement in my stored procedure can't work:IF EXISTS ( SELECT * FROM MY_TABLE WHERE MY_FIELD1 = @IN_MY_FIELD1 AND MY_FIELD2 = @IN_MY_FIELD2 AND MY_FIELD3 = @IN_MY_FIELD3 AND MY_FIELD4 = @IN_MY_FIELD4 AND MY_FIELD5 = @IN_MY_FIELD5 AND MY_FIELD6 = @IN_MY_FIELD6 ) BEGIN goto on_duplicate END, since
NULL = NULL
is not true.How can I check for the duplicates without having an
IF ISNULL
statement for every column?
First of all, can't help being a smartass and tell that a UNIQUE
index should be defined over these columns.
SQL Server does index NULL
values and they are first class values as long as uniqueness is concerned, so creating this index will manage the duplicates automatically.
Now, back to business. Of course there are cases when creating such an index is not acceptable.
In these cases a very nice but little known operator INTERSECT
comes handy.
This operator takes two resultsets with the same layout and finds rows common to them (eliminatnig duplicates), as SQL Server's understands common: each value in each column should be either equal to one in the other row or they both should be NULL
's.
Let's create a sample table:
Read the rest of this entry »
Overlapping ranges: PostgreSQL
Comments enabled. I *really* need your comment
Continuing the theme on overlapping ranges:
From Stack Overflow:
I have an event table that specifies a date range with
start_date
andend_date
fields.I have another date range, specified in code, that defines the current week as
week_start
andweek_end
.I'd like to query all events for the week.
Finally, PostgreSQL.
A quick reminder: we have two options to write a condition for such a query.
A more simple one:
start_date < @week_end AND end_date > @week_start
is, uh, more simple, but less efficient in SQL Server and MySQL.
A more complex one:
(start_date > @week_start AND start_date < @week_end)
OR
(@week_start BETWEEN start_date AND end_date)
is more index friendly and is more performant in SQL Server, Oracle and MySQL (with a slight schema change to enable SPATIAL index support)
Now, let's create a sample table in PostgreSQL:
Read the rest of this entry »