My blog is about SQL, and SQL is about relational databases.
So, what's SQL and what's a relational database and what's an entity-relationship model?
These terms are often misused and this leads to confusion.
In this article, I'd like to explain what a relational database is.
For starters, a little quote from Jack London's
Howkan began to read. Imber listened for a space, when a wonderment rose up in his face and he broke in abruptly.
That be my talk, Howkan. Yet from thy lips it comes when thy ears have not heard.
Howkan smirked with self-appreciation. His hair was parted in the middle.Nay, from the paper it comes, O Imber. Never have my ears heard. From the paper it comes, through my eyes, into my head, and out of my mouth to thee. Thus it comes.
Thus it comes? It be there in the paper?Imber's voice sank in whisperful awe as he crackled the sheets 'twixt thumb and finger and stared at the charactery scrawled thereon.It be a great medicine, Howkan, and thou art a worker of wonders.
It be nothing, it be nothing,the young man responded carelessly and pridefully. He read at hazard from the document:In that year, before the break of the ice, came an old man, and a boy who was lame of one foot. These also did I kill, and the old man made much noise—
It be true,Imber interrupted breathlessly.He made much noise and would not die for a long time. But how dost thou know, Howkan? The chief man of the white men told thee, mayhap? No one beheld me, and him alone have I told.
Howkan shook his head with impatience.Have I not told thee it be there in the paper, O fool?
Imber stared hard at the ink-scrawled surface.As the hunter looks upon the snow and says, Here but yesterday there passed a rabbit; and here by the willow scrub it stood and listened, and heard, and was afraid; and here it turned upon its trail; and here it went with great swiftness, leaping wide; and here, with greater swiftness and wider leapings, came a lynx; and here, where the claws cut deep into the snow, the lynx made a very great leap; and here it struck, with the rabbit under and rolling belly up; and here leads off the trail of the lynx alone, and there is no more rabbit,—as the hunter looks upon the markings of the snow and says thus and so and here, dost thou, too, look upon the paper and say thus and so and here be the things old Imber hath done?
Old Imber made a great analogy. Information is but markings on the snow. When you look at the snow, there are no rabbit or lynx there. But when an experienced hunter sees their traces he can go and find them.
However, the old Indian could only read the information while his nephew could also create it by writing the words on paper and reading them back. He had full control over the information he created.
And for information to be useful, it should be structured.
When a car dealer describes a car he really wants to sell he can spend half an hour talking of magnificent design, good extras and a decent price, but just
forget to mention that, say, the warranty terms are not those you'd expect.
But when you fill a credit card application form, you answer a set of predefined questions. What is your name? Where do you live? How much do you earn? Do you have incurable cancer?
The bank needs all this information to make sure you will pay your debt (or at least they can get a decent price for your debt when it's sold to a collection agency).
And they want you to hide no important information about yourself and to represent all information in such a way that it can be easily interpreted and make sure nothing they are interested in would be
forgotten to answer or left behind.
They want the data to be structured. And by making you to fill a form they enforce the structure of the information you provide.
They force you to use their data model.
Information in a database should be of course structured just as well. You cannot fill a file with something like
John is a nice person. He is tall and has blue eyes and enjoys playing tennis and singing and hope it to be easily searchable or processable by a computer. Instead, you fill the database with the information the database designer considered to be important using a well defined format and if you didn't fill something that you were required to fill, you just have to start over.
Paper forms enforce the structure by making you fill the square boxes with letters and check marks. Database management systems are more flexible but they offer some ways to enforce data structure just as well.
What can you put into a file? Almost everything. You can even put the whole
War and Peace. The problem is that you cannot use a computer to tell the name of Pierre Bezukhov's wife or how old Prince Bolkonski was when the war started. You will need to read the whole file yourself to figure it out.
If you want a computer to do it, you should define the relations between the different pieces of information that are stored in the database. Basically, you should give an algorithm on how to determine where to look.
The way the relations between the different pieces of information are defined is called a database model.
If you read a legal document (like CC-BY license), you will see something like this:
You may Distribute or Publicly Perform the Work only under the terms of this License. You must include a copy of, or the Uniform Resource Identifier (URI) for, this License with every copy of the Work You Distribute or Publicly Perform.
The capital letters in the beginning of the otherwise common words make you a hint that these are some legal definitions in scope of this document.
Legal documents are modeled so you need to scroll the document from the beginning, find a section called
Definitions and read what is to
Distirbute and what is
License and who on Earth are
You, and you better pay attention to it.
Each time you see a capitalized word in a legal document, you know it's a reference to some other place in the document.
Databases rely extensively on the concept of references.
This is made for two reasons: first, you can save space by replacing the whole piece of information with a relatively short reference to it, second, if you need to update something, you update it in one place and make the computer to do all substitutions for you.
Historically, most popular database models are hierarchical, network and relational.
Let's see what that all means.
Hierarchical model, as the name suggests, stores hierarchies. Each piece of information stored in the database has some kind of a parent. A database, basically, offers two methods: get all children of a given record and get a parent of a given record. Everything else is up to you. You can define your own data structure and access them using these two methods.
For instance, you want to keep the clients, orders and items. This perfectly fits into this model: you store information about your clients in the nodes that are children to the root node (which each hierarchical database has), information about the orders is stored in the nodes that are children to the client nodes, and information about the items ordered is stored in the nodes that are children of the order nodes.
Now you want to get the total count of items a certain client has ordered. No problem.
You begin with the root node (in a hierarchical database, you always begin from the root node), and iterate though its children (which are clients, by convention).
Now you find the node that represents the client you need (the database is not interested in how do you do it), and issue a command to iterate though the children of this node.
For each child of that node (which is an order, by convention), you iterate through all of its children, each of them, in its turn, being an item. And when if find an item, you just increment a counter.
Usually, a database can batch all these commands and even make some simple decisions on how to implement these tasks in more or less efficient way.
The point is that if the database is hierarchical, you formulate your needs in terms of these parent-child relationships.
For each child of the root node that happens to be the given client, find all chidren, and for each of them, find its children, and for each of them, increase the counter
Doesn't it remind you of something?
One of the most popular data storage formats is hierarchical. Yes, I'm talking of XML.
It basically requires that you have tags properly open and closed within each other. If a pair of tag is enclosed into another pair of tags, the former is the child while the letter is the parent.
If we have data in this format:
<main> <client name="John"> <order id="1"> <item name="beef" quantity="1"/> <item name="butter" quantity="3"/> </order> <order id="2"> <item name="pork" quantity="2"/> <item name="bread" quantity="4"/> </order> </client> <client name="Jim"> <order id="3"> <item name="pork" quantity="1"/> </order> </client> <client name="Mary"> <order id="4"> <item name="oil" quantity="5"/> </order> </client> </main>
, how do we count the number of items ordered by John?
We issue the following XPath query:
This is basically what was described:
beginning from the root node,
get all child nodes that happen to be clients,
and whose name is John,
and for each of these nodes, take the child nodes that happen to be orders,
and for each of these nodes, take the child nodes that happen to be items,
and count them
Now, how do we find the clients that ordered both beef and pork?
We also should traverse hierarchies here. Something like this: beginning from the root node, find all children (clients) that have a child (order) that has a child (item) that are beef; and also have a child (order) that has a child (item) that is a pork.
In XPath, it looks like this:
As you can see, in a hierarchical model everything is about walking the hierarchy trees hither and thither. All data access paths and queries revolve around hierarchies.
Network model is an extension to the hierarchical model. Each node in the network model can have multiple parents and children so that you can walk in many directions from any node.
The operations are more complex: get all children in a given descendency line, get a parent in a given ancestry line.
This, of course, can make the queries more easy and elegant (and their implementation more efficient) than using a hardcoded one-way hierarchy.
For instance, in a network model you can keep the order items as children right to the clients (while retaining the client-order hierarchy), and at the same time keep them as children to goods ordered, so that each order item is a child on three lines to three different nodes: a client, an order and a good.
In this design, finding a client who ordered a certain good is more easy: for each child of the good, find all its parents on the client line.
However, we should maintain these relationship ourselves. What is our data changes? We should not only update the data, but the references too. And if the relationship structure grows complex, this can be inefficient and prone to errors.
Of course we can not to define these relationships but use the same path access methods as we did in XML to infer them at runtime. But we could then just use hierarchical model as well. The whole point of network model is that it allows you to hardwire some relationships to speed up data access if you are ready to support them.
In the models we considered before, everything revolved around walking from node to node. You have to define references, walk them, find the nodes and do things with the content of the nodes.
Everything's fine except that sometimes it's inefficient and adding new reference paths is always a pain.
Besides, the reference paths are predefined. The database is not aware of their nature and cannot make shortcuts.
Is there a way to make the reference management easier and data access faster?
In 1969, Dr. E. F. Codd found such a way and proposed what we now call relational model.
We just talked about hierarchies, references, paths. All these are relations.
An order made by a client in the example above is related to that client. An item that is a part or the order is related to that order. A good that the item consists of is certainly related to the item.
Is a good once ordered by a client related to that client? Well, it is, if we think about it. It certainly is. We can define this relation in English:
if a good was ever ordered by a client, it is related to that client, otherwise not.
How do we set this relation in a hierarchical model? We don't. There is only one kind of a relation, namely a parent-child relation. An item can be a child either to a good or to an order, not both. We can only check for the relation in runtime, traversing all clients, all items and all goods. Very inefficient.
How do we do it in a network model? We make a client-good ancestry line and update reference from a client to a good each time an item of that good is ordered. We need to track all items and delete the client-good references as soon as the last item of that good was removed from the client's orders. Too hard to maintain.
In both these examples, the relations are either hardwired and require an effort to maintain them, or should be resolved in runtime using hardcoded and inefficient access paths.
Codd's idea was to rethink the whole concept of a relation.
In the examples above, the order is a piece of information stored somewhere in the database. It has some pointers that point to the other related nodes, or its position defines a relation, or something else. Is any case, the relation is defined when designing the database.
However, a relation (say, a binary relation) can be defined by mere enumeration of all possible pairs of related entities. We just take a sheet of paper and write down every single pair of all pairs possible. If the pair is written down, the relation exists. If it's not, there is no relation.
And what is most important, we don't relate the information, or the nodes, or the file parts. We relate the names, monikers, identifiers, something that has a type that can be easily processed by a computer.
Consider an XML:
<client name="John"> <order id="1"> <item name="beef"/> <item name="oil"/> </order> <order id="2"/> </client> <client name="Mary"> <order id="3"> <item name="oil"/> <item name="apple"/> </order> </client>
The relation is defined by the position of the
<order> tags in the file. If we cut and paste them out of the client tag, we lose the relation. We will have two order nodes unrelated to any client.
But we rather can enumerate these relations in a list:
In the left column this list contains the integer numbers. In the right column it contains strings up to 20 characters in length.
And it completely defines a relation. It is the relation between integers and strings up to 20 characters.
We see the pair 1, John? Yes.
Is 1 related to John? Sure, we see the record for that. 1 from the left, John from the right, everything on its place. 1 is certainly related to John.
Is 2 related to John? Sure it is, for the same reason.
Is 3 related to John? No, there is no record. Why? I don't know. Is it because the order 3 is related to client Mary? Dear user, I don't know what's an order, what's a client, who's Mary or is 3 too much or too few. I'm a relation defining table. I know nothing of the nature of these integers and strings. You see the record, there is a relation. You don't see the record, there is no relation. Want a relation, add a record. Don't want a relation, remove a record.
Is Mary related to 3? Stupid question. Integers to the left, strings to the right. Mary is not an integer, it's out of the relation's domain. The question has no sense from this relation's point of view.
In the same way we can define a relation between orders and items. Just write down all pairs of items and orders that are related:
We don't see order 2 here. Why? Because no item is related to that order. This order does not participate in the relation, that's why we don't see it in the table that defines the relation.
This may be because the order with such an id does not exist, or may not exist at all because the id is invalid, or may not participate in the relation because of some constraints. A relational table does not care.
By mere enumerating the pairs of related elements, we define two relations: that between items and orders and that between orders and clients.
A relational table, however, knows nothing about items, orders and clients. All they contain are the sets or numbers and strings.
In order for these relations to make sense, we should make one obvious but important assumption: an item is completely defined by the name of the product, an order is defined by the order id and a client is defined by his or her first name.
By defining the relations, we imply that if two orders have the same id, these two orders are in fact the same order. If you see two clients with the same name, this is one client. What if two different clients share one name? Too bad then, you've chosen a wrong way to define a client. Add the last name or use an SSN to identify a client, or assign them your very own numbers or whatever. What if John ordered olive oil and Mary ordered grape seed oil? You cannot tell if you chose the word
oil to define both these products. This is not a relational database's concern.
Let's consider an XML model:
<client name="John"> <order id="1"/> </client> <client name="John"> <order id="2"/> </client>
We have two different clients here, both named Johns. We can tell them apart because they are described in two different pairs of tags. And order 1 is related to John #1, while order 2 is related to John #2. If we were to define a relation between orders that would be based of the fact that the orders are made by same person, order 1 and order 2 would not participate in this relation.
Now, let's make the same in relational model:
There is no way to tell clients apart. We just picked a wrong way to identify a client. We defined the relation between orders and clients to be a relation between order id's and clients' first names.
This way, defining a relation between things requires defining a relation between the monikers or handlers or identifiers of these pieces which should be defined as simple data types: integers or strings or sequences of bytes.
And if there is one thing computers are good at, it's certainly comparing integers and strings.
OK, that's nice. Now, how do we define a relation between items and clients?
We could possibly create a third table that would keep all related pairs just as well as the first two tables do. However, the relation in question can be defined with a simple formula:
an item is related to a customer if an only if exists but a single order the item is related to which is related to a customer.
We can just express this relation in terms of relational algebra.
First, we make a cartesian product of these two relations (which gives us a new relation table):
This way, we just make a combination of all possible relations between
orderId again and
clientName, that may be implied from two relations defined above. This is a quaternary relation that makes no sense in real life, but as I said above, the relational database doesn't care. Mathematically, everything's OK.
Then we should perform a selection: from the list of all possible relations that we just built, choose those relations that satisfy the integer comparison condition between the
orderId's from the right side and from the left side. This is perfectly legal, since
orderId's are integers and the integers can be compared.
Here's what we get after this:
You see that of all possible combinations of records the selection left only those that have matching
orderID's is the right part and in the left part.
We get a new relation between
orderID again and
clientName. This still makes no sense: we need a binary relation, one between
Finally, we do a projection: we remove the excessive elements from the relation, shrinking the duplicates. A table defining a relation should have no duplicates: there is no point in having them. It's enough to list a row once, a computer will get the idea.
Now, we project the quaternary relation between
orderID again and
clientName into a binary relation between
The best part in it that we don't make these relations hardwired and don't need to rewire them manually. We just feed computer with the lists of numbers and strings and tell it to perform the operations on these strings.
In a hierarchical and network models we maintain the relations by modifying the stored elements. To make an order to be related to a client in an XML file, we cut the whole block for that order (including everything inside it) and paste it inside the client tag. We should know the database layout to do this.
Hierarchical model lets you define simple parent-child relations as paths between the pieces of data.
To build more complex relations, you should combine these paths.
We define parent-child relationship between the pieces of information by modifying the physical structure of these pieces: changing file pointers, moving XML tags, etc.
More complex relationships should be defined as the algorithms of moving up and down the hierarchy tree and filtering the information we find while following the paths.
Network model lets you define complex relations as paths between the pieces of data.
Since these relations are more complex than those in hierarchical model, discrepancies can arise between relations defined physically and those existing logically.
It's your duty to maintain the correspondence between them.
In network model, we define relationships between the pieces of information by making the complex paths between them, including but not limited to the parent-child relationships. The queries should be built using the path iterations just as well, but this time, we can make shortcuts (like the bottom arrow on the picture above).
The problem is that when the relations are multiple, these shortcuts are hard to maintain. Say, if we relate an order to another client, logically, all the order's items should be related to that client as well. We need to fix both the order-client relation and item-client shortcut. This makes the updates and schema changes painful.
Relational model replaces relations between entities with relations between integers, strings and another simple types.
You can define most of relations logically, not physically, and the database system can build them from their logical definitions at runtime.
The clients, orders and items are not in the database. They are related to integers and strings which describe these clients, orders and items, but these relations are not the database's concern.
A relational database works with the relationships between these simple types. These relationships can be defined in two ways:
- By mere enumerating the integers and strings that participate in a relation
- By combining relational algebra (Cartesian products, projections and selections) with mathematical relations between integers and strings: equality, inequality, less or, greater then, functions, logarithms — everything that a computer can calculate
This is of course much more powerful.
In hierarchical and network models, we need to optimize pathfinding, and neither computers nor humans are good in it.
In relational model, we need to optimize arithmetics and relational algebra. Computers are extraordinarily fast at arithmetics, and efficient algorithm exist for many relational operators, like a cartesian join combined with a selection on equality of two attributes, or a projection, oe even more complex ones.
This was made possible by replacing relations between entities with relations between integers, strings and another simple types.
Let me repeat this and even enclose it into a red box:
Relational databases do not store relations between entities.
They store relations between integers, stings and another basic data types.
Defining relations between entities and their handles is not the relational database's duty.
This may seem counter-intuitive if not wrong. Everybody knows that relational databases store entities and attributes. How so?
Differences between storage model and semantic model
In a hierarchical model and network model a database can only follow predefined paths which are a part of database's design. They do not distinguish between storage model and semantic model. In these models, how pieces of data are related is how they are stored.
But in a relational database you can relate everything to everything.
Senseless comparisons between
orderID and the item's price? Why not, they both are numbers and comparable. Fetching all clients whose name is longer than the number of orders they made? No problem! Et cetera, et cetera.
The relational database itself, as I repeat, knows nothing about the nature of numbers and strings it stores. It will accept anything you put into it.
You can call the tables that store relations
bar, the attributes
qux and put any combinations of these attributes and run the most extraordinary queries.
This can confuse even an experienced developer. To cope with that, some rules had been invented which somewhat constrain what can be stored in a database so that brings some order into the anarchy of pure relations.
Unlike other models, a relational database distinguishes between storage and semantical models.
Storing, transforming and querying relations is a task for a relational database.
Defining what will be stored is a task for an entity-relationship model.
They are so often used together that they almost became synonyms, still they are completely different things.
In the next article, I will try to cover entity-relationship model more deeply.