One of the first things a novice SQL developer learns about is called
thinking in SQL, which is usually being opposed to
Let’s see what part of brain does this intellectual activity take and how to use it.
Two features distinguish SQL from other languages you learned as a 11-year old kid on your first PC, like BASIC or perl or maybe even C++ if you’re such a Wunderkind.
First, SQL is set-based. It does things with sets.
Every tool is designed to do things with something else. Like, you use a hammer to do things with nails, or use a screwdriver to do things with screws, or use an oven to do things with food.
Same with computer languages.
BASIC does things with variables. perl does things with scalars, arrays, hashes and file streams. Assembly does things with registers and memory.
You should not be confused by something like
registers are just a special case of variables, or
a hash is just a generalized container which exposes this and this method or something like that. No.
A hash is a hash, a variable is a variable and a register is a register.
Like, an egg is a food and rice is a food and it’s possible to cook some eggs in a rice cooker and vice versa, but they are just wrong tools to do that.
Prehistoric men had to make do with hammerstones and lithic blades (even to court their women), but now we have a whole district in Tokyo City for gadgets with USB type A, and another district for gadgets with USB type B.
So if you feel the urge to hash something and then make a good old array out of this, you don’t use assembly, but perl or PHP instead.
Same with SQL. SQL does things with sets.
It’s a tool that allows you to take a dozen or two of sets, mix them together, knead and wedge them then chop them apart and mix again, but the output you get is still a set and all inputs are sets.
Eeverything you do in SQL, you do it on sets. That’s why SQL is called a
Ok, that was the first feature that distinguishes SQL from other languages. What’s the second one?
SQL is a declarative language. This means that you express
what you want to do with sets, not
how you want to do it.
This requires a little explanation.
In fact, every language is declarative, even assembly.
When you issue a command like
xor eax, eax, you just command your CPU:
go warm up your circuits and make it so that I see a zero in EAX right this nanosecond!.
This command needs to be translated into an opcode, and the opcode should go to the instruction pipeline and dispatched into some circuits and make some transistor to change their state, but it is none of your concern. The language and the CPU abstract all this from you.
In the same way, perl abstracts hashes and arrays, ASP abstracts HTTP requests and responses and Windows API abstracts windows and controls.
SQL, of course, abstracts sets.
Abstractions simplify your life. But an abstraction in computer world usually means one of two things:
- You save some time on brainless operations, like typing or remembering, or
- You save some time on decision making
Brainless part is easy: you just do things human way and make computer to do same things computer way. This is usually a one-to-one mapping.
xor eax, eax is much easier to remember than
31 C0. That’s why we use assembly.
int __stdcall is much easier to remember than
pass variables by value in stack, rightmost first, return the result in EAX and clean the stack inside the calling function. That’s why we have calling conventions.
Darling Ctrl-K is much easier to type than something like
my spouses really long email address made up by a sysadmin whose head is full of unwarranted self importance at example dot com. That’s why we use address book in Outlook.
These examples don’t imply any decision making. There is hardly a way to mess something up here, that’s why computers are very good in these things.
Decision part is more complex, and it usually deals with more complex abstractions.
In an x86 CPU there are at least two ways to zero a EAX:
xor eax, eax, and
mov eax, 0
The first instruction is more efficient and takes only 2 bytes in code, while the second one is (or at least was) less efficient and takes something about 5 bytes.
The problem is that the first instruction works only on zeros.
Like, for the following C code:
if (myfunc()) return 1; else return 0;
, an efficient
C compiler should generate different opcodes for different execution paths, despite the
return construct being the same. This is where compiler makes a decision:
if I need to return a zero in EAX, I’ll better use .
xor eax, eax, otherwise I just have no choise but to use
Note that the code will still work if the compiler makes incorrect decision, it just will be 3 bytes longer and maybe several nanoseconds less efficient.
That’s what distinguishes
time saving features from
decision making features.
In the first case, you know in advance how the computer will make something, and if it’s doing it wrong way, it’s doing it wrong.
In the second case, you know what will it make, but don’t know (or don’t care) how as long as it’s done.
SQL heavily, vastly relies on decisions made by the computer. Probably more than any other widely used language.
There is only one way to translate a
xor eax, eax into an opcode. Two ways to compile
return 0 in an
stdcall function. Probably a dozen ways to decide which dead object should be recycled by a garbage collector.
But for a simplest
SELECT query which is only 30 rows long, there are hundreds of decisions to make, and I’m not making this up:
- There are
INDEX FAST FULL SCAN,
INDEX SKIP SCAN,
INDEX RANGE SCANto retrieve the values from the tables
- There are
NESTED LOOPS JOIN,
HASH JOINto join them
- There are
NESTED LOOPS SEMI JOIN,
MERGE SEMI JOINand
HASH SEMI JOINto check
- There are
NESTED LOOPS ANTI JOIN,
MERGE ANTI JOINand
HASH ANTI JOINto check
- There are
HASH AGGREGATEto group the records
- There are
EAGER SPOOLto build temporary indexes for the duration of the query
- There are numerous ways to pipeline and parallelize the resultsets to benefit from multiprocessing
- There are bitmap indexes to indicate existence of a record in a memory-efficient way
- There is
CONNECT BY PUMPto keep the recursion stack for recursive queries
But why so complex?
As I already said, SQL works on sets.
Sets are very complex things which can take lots of memory and disk resources, and operations on them may be very costly.
And worse than that: there are many operations for which there are no single
algorithm of choice.
For instance, there are 3 widely used
NESTED LOOPS JOIN,
MERGE JOIN and
The first one is best for joining a resultset on a selective indexed field, the second one is best for joining sorted resultsets, and the third one works equally poor for everything, but still exposes the best time when the resultsets are neither filtered nor sorted, but this only applies for joins on equality condition, otherwise it’s impossible to use it, et cetera, et cetera.
The whole point of SQL is that it abstracts all this complexity from you. You just write your
SELECT * FROM table1 NATURAL JOIN table2 and let the optimizer choose all these, Lord help me, hashes or merges or indexes or whatever.
The first thing any SQL developer should learn is formulating his needs in the terms of sets.
I won’t give a lecture of relational algebra here, but there are some basic operations you can do with sets, like joining them, grouping, sorting etc., and by combining these operations, you can do almost everything with sets.
Like with plain algebra: there is addition, and subtraction (which is just a generalized addition), and multiplication (which is a generalized addition too), and so on, and finally we get a proof Fermat’s Last Theorem (which is probably is a very, very generalized addition too, but I’m not smart enough to grok it).
thinking in SQL means an ability to formulate your needs in terms of set operations and to tell your database system how exactly you’d like to mutilate your data before it’s output to you.
Here’s the magic of SQL, kids. You just tell it what you want, and it decides the best way to do it, that’s how smart computers are. Thank you for your time, goodbye.
Why, you’re still here?
Oh, I finished on
thinking, and my article was called
Remember I said that SQL decides the best way?
Well, no, it doesn’t. And every SQL developer sees it very, very soon.
Columnist Joel Spolsky coined the term
Leaky Abstraction for the situations when the abstracting tool gives you a result in a non-optimal way, or gives no result at all. And SQL is number 2 in his list of examples, right after iterating a two-dimensional array which can cause lots of page faults if done improperly.
The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it.
But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries.
A famous example of this is that some SQL servers are dramatically faster if you specify
where a=b and b=c and a=cthan if you only specify
where a=b and b=ceven though the result set is the same.
You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.
The most interesting thing he mentioned is the query plan analyzer.
Let’s return to other languages. Like PHP. PHP abstracts lots of things, but among the most used ones are arrays , HTTP requests and web servers.
In PHP, you just write something like
$a = $_SERVER['HTTP_USER_AGENT'] and the string that browser identified itself with in the HTTP request goes to the array.
Ask almost any decent PHP developer on what’s going on inside the computer when this command executes, and he or she will most probably shrug and give an answer like
It must allocate some memory, probably, then get the string out of Apache, and then, um, put the string into the memory and increase the counter somewhere.
In other words, they don’t know and don’t care. But they still are brilliant PHP developers, they can do wonders with arrays and requests.
gcc has an option to output the assembly when you compile a program so you can check how effciently it did its work. But ask any C++ programmer what switch do you use to do that, and 9 of 10 would not be able to remember it. It’s just not used frequently.
Knowing the PHP or gcc internals is not required to use these tools efficiently. PHP just works and you can program good PHP all your life without ever knowing how exactly does the PHP interpreter push a function reference to an array of hashes.
And of course no PHP distribution comes with a debugger that shows you what is it going do on each step:
look, I’m going to allocate this memory and fill it with zeroes and scan it and put a string here.
However, every decent RDBMS has some mean to preview an execution plan for your query. How exactly will the query work? What algorithm does the query optimizer use? Will it scan table A or table B first?
I know no other tool or language that exposes so much of its internals.
To know what PHP or perl does when allocating an array, you’ll have to run it under debugger. To know what .NET does when creating as instance of List is probably a licence violation, a felony and a deadly sin, all at once. Many products just refuse to run under debugger so that their precious internal logic of
twice two making a wax candle would not leak outside.
But when it comes to SQL, Microsoft gives a graphical tool which allows you to see what exactly happens when you run an SQL query. And unlike gcc -s, this tool is used heavily.
They actually made an effort to show you what’s going on inside their product. To show, not to hide.
They are exposing exact algorithms used by the server, they are explaining how much time will an operation take, and they are not hiding it in a dusty PDF file: they make it so that you can actually understand it, with icons and arrows and nice charts.
This is of course not because they are so generous and open-hearted and want to contibute their knowledge to a free world. This is because if not for this tool it would be impossible to use their product at all. Stay assured that if they invented an algorithm that would allow generating optimal queries, they would do anything possible and impossible to keep it secret.
SQL abstraction leaks like a sieve filled with superfluid helium. And this is not because RDBMS developers are dumb. This is because the thing they are trying to abstract is way too complex to do it efficiently. The possible operations on sets and the intermediate results of these operations are so numerious and multifarious that with current level of technology it’s just not possible to write an efficient optimizer, an
optimizer as in optimal.
In SQL world, a
correct query does not mean an
optimal query, which is true for almost tools. But is doesn’t even mean a
usable query, since the query plan can be so unoptimal that you will wait forever for a query to complete.
Any decent SQL developer should know everything about execution plans and see in advance how efficient or not efficient they are.
Here is a classical example. Don’t worry, it’s quite simple.
Say, we need to select all rows from a table along with a running sum:
sum here accumulates
value added on each step.
In relational algebra, it’s super easy to write a query that would make this resultset:
SELECT t1.id, t1.value, SUM(t2.value) FROM mytable t1 JOIN mytable t2 ON t2.id <= t1.id GROUP BY t1.id, t1.value
This reads as
id, return the sum of
value in rows with
id less or equal to given.
Quite simple, right?
No SQL engine I’m aware of can run this exact query optimally.
An optimal solution would be something like
take a row from the table, record the current value of .
sum, take next
value, add it to
sum, pipeline the result
But the optimizers are not programmed to see this algorithm in the query worded as above. Best they can do is something like
take a row from the table, then take all other rows with smaller . It the best algorithm they can come up with, and I’m not making this up. There are worse algorithms, like
ids, calculate their sum, pipeline the row, then forget the previous result and do it all again, on a larger set of rows
select all matching rows, hash them together, calculate the sums and use disk space to keep intermediate results for they don’t fit into memory
Of course modern systems can use more efficient queries, like this:
SELECT id, value, SUM(t1.value) OVER (ORDER BY id) FROM mytable
, but this is different query. An optimizer cannot see that this query is identical to the first one in terms of results. It’s the task of an SQL developer to see it and reformulate the query.
Optimizers can efficiently do some simple tasks like picking a correct index, choosing correct access method depending on the filter selectivity and so on. However, it’s not enough to allow formulating your queries in SQL so that they are just correct. You still need to formulate your queries so that they are usable, i. e. an optimizer is able to build an efficient plan for them.
That of course means you should know how optimizers work.
In eariler systems, like Oracle before version 8, optimizer had a set of rules. Like, if we see a filtering condition on an indexed field, we use that index, etc. For simple queries, that yielded decent plans, for more complex queries that was a pain.
More advanced optimizers try to predict how much time will one or another plan take. Usually they gather statistics to do that. That is, from time to time they see how the values are distributed along your tables, keep some aggregates, and try to predict the query time.
They sometimes mistake, but this is not the real problem. The real problem is that even the brightest of optimizers cannot reword even the simplest queries. They cannot even change a self-join to a window function in the query above. This still should be done by an SQL developer.
Thinking in SQL means
formulating you needs in terms of set operations. This is what usually said. And this is truth, but not the whole truth. This is not enough to be a good SQL developer.
To be a superb SQL developer, you need to doublethink in SQL. This Orwellian term describes the process the best:
Doublethink means the power of holding two contradictory beliefs in one’s mind simultaneously, and accepting both of them.
To be a good SQL developer, you should be able to imagine the query in terms of sets and in terms of algorithms at the same time.
It’s not enough to come up with a set operation that yields correct results. It will be inefficient.
It’s not enough to come up with a good algorithm that transforms your set as you need and try to force-feed it to the server using procedural approach. The servers are not good at that.
What you need to do is to formulate your query so that the optimizer chooses exactly the algorithm you need. And to do this, you should know both the algorithm and the set operations. You need to doublethink.
This is not always an easy task to do. But it definitely deserves learning.
And only if you learn it you will be able to harness all the power of SQL.