This introductory chapter is just meant to provide “the view from 30,000 feet,” as it were. It’s deliberately not deep, and if you do already know something about database management you probably won’t find anything here you don’t already know. But I think it’s worth your while to give the material a “once over lightly” reading anyway, if only to get a sense of what background knowledge I’ll be assuming and relying on in subsequent chapters. Also, the chapter introduces the running example, which you’ll definitely need to be familiar with when we get to those later chapters.
A database can be thought of as a kind of electronic filing cabinet; it contains digitized information (“data”), which is kept in persistent storage of some kind, typically on magnetic disks. Users can insert new information into the database, and delete, change, or retrieve existing information in the database, by issuing requests or commands to the software that manages the database—which is to say, the database management system (DBMS for short). Note: Throughout this book, I take the term user to mean either an application programmer or an interactive user or both, as the context demands.
Now, in practice, those user requests to the DBMS can be formulated in a variety of different ways (e.g., by pointing and clicking with a mouse). For our purposes, however, it’s more convenient to assume they’re expressed in the form of simple text strings in some formal language. Given a human resources database, for example, we might write:
EMP WHERE JOB = 'Programmer'
And this expression represents (let’s agree) a retrieval request—more usually known as a query—for employee information for employees whose job title is ‘Programmer’.
Figure 1-1 below shows sample values for a typical database, having to do with suppliers, parts, and shipments (of parts by suppliers).
As you can see, this database contains three files, or tables. (Actually they’re relations, as we’ll see in Chapter 2, but for the purposes of this introductory chapter “files” or “tables” will do.) The tables are named S, P, and SP, respectively, and since they’re tables they’re made up of rows and columns (in conventional file terms, the rows correspond to records of the file in question and the columns to fields). They’re meant to be understood as follows:
Table S represents suppliers under contract. Each supplier has one supplier number (SNO), unique to that supplier; one name (SNAME), not necessarily unique (though the sample values shown in Figure 1-1 do happen to be unique); one status value (STATUS); and one location (CITY). Note: In the rest of this book I’ll abbreviate “suppliers under contract,” most of the time, to just suppliers.
Table P represents kinds of parts. Each kind of part has one part number (PNO), which is unique; one name (PNAME); one color (COLOR); one weight (WEIGHT); and one location where parts of that kind are stored (CITY). Note: In the rest of this book I’ll abbreviate “kinds of parts,” most of the time, to just parts.
Table SP represents shipments—it shows which parts are shipped, or supplied, by which suppliers. Each shipment has one supplier number (SNO); one part number (PNO); and one quantity (QTY). Also, there’s at most one shipment at any given time for a given supplier and given part, and so the combination of supplier number and part number is unique to any given shipment.
Examples throughout the rest of this book are based for the most part on the foregoing database. Now, you might well have seen this database before—I’ve used it in several other books and writings, including SQL and Relational Theory in particular, as well as in numerous live presentations—and you might be forgiven for getting a little tired of it. But as I’ve written elsewhere, I believe using the same example in a variety of different publications can be a help, not a hindrance, in learning. Of course, it’s true that real databases tend to be much more complicated than this “toy” example; but the trouble with using more realistic examples is that they tend to be too complicated (they make it difficult to see the forest for the trees, as it were). Let me say too that even if it is somewhat unrealistic, the suppliers-and-parts database has at least been very carefully tailored to illustrate all kinds of points that we need to be examining later on; thus, it’s more than adequate for our purposes in this book. Please note, therefore, that in examples throughout the rest of this book, I’ll be assuming the specific sample values shown in Figure 1-1, barring explicit statements to the contrary.
Now, when I say a figure like Figure 1-1 “shows ... a typical database,” what I mean is it shows that database as perceived by the user (what’s sometimes called a logical database). That logical database is contrasted with the corresponding physical database, which is the database as perceived by the DBMS (i.e., it’s the database as physically stored inside the computer system). Note carefully, therefore, that what I’m calling here the logical and physical databases aren’t two totally different things; rather, they’re two different perceptions of the same thing. Refer to Figure 1-2. As that figure is meant to suggest, the DBMS—the software that manages the database—effectively serves as a kind of intermediary between the logical and physical levels of the system: User requests for access to the database are expressed in terms of the logical database, and they’re implemented by the DBMS (“executed”) in terms of the corresponding physical database.
One general function provided by the DBMS is thus the shielding of users from details of the physical level of the system (very much as programming language systems also shield users from details of the physical level of the system). In other words, the DBMS provides users with a perception of the database that’s more abstract, and thus more user friendly, than the way the database looks physically—i.e., the way it’s physically stored inside the system.
Now, a DBMS is a complex piece of software, and it consists of many components. But there’s one component I’d like to mention right away, since it’s so important to the subject of this book, and that’s the optimizer. The optimizer is the DBMS component that’s responsible for deciding exactly how to implement user requests. The point is this: Most requests (indeed, very likely all requests) are capable of being implemented in a variety—typically a very large variety—of different ways. Moreover, those different ways will typically have widely differing performance characteristics; in particular, they could have execution times that vary, quite literally, from fractions of a second to many days. Thus, it’s very important that the optimizer choose a “good” way to implement any given request, where “good” essentially means having good performance.
One immediate and significant implication of the foregoing is this: Assuming the optimizer makes a reasonably good job of things, users shouldn’t have to get involved in performance issues at all. As a matter of fact—to jump ahead of myself for a moment—let me state here for the record that it is and always was a major objective of the relational model that it should be the system, not the user, that has to worry about performance issues. Indeed, to the extent this objective is not met, the system can be said to have failed (or certainly to be less than fully successful, at any rate).
The fact that the logical and physical databases are distinguished and (ideally, at least) kept rigidly apart is what allows us to achieve the important goal of data independence. Data independence—not a very good term, by the way, but we seem to be stuck with it—means we have the freedom to change the way the database is physically stored and accessed without having to make corresponding changes to the way the database is perceived by the user. Now, the reason we might want to change the way the database is physically stored and accessed is almost always performance; and the fact that we can make such changes without having to change the way the database looks to the user means that existing application programs, queries, and the like can all still work after the change. Very importantly, therefore, data independence means protecting existing investment—investment in user training, in existing applications, and in existing database designs (among other things).
To repeat, the DBMS acts as an intermediary between the logical and physical databases; in other words, it supports the user interface to the database. (For the rest of this book, I’ll take the term database, unqualified, to mean a logical database specifically, unless the context demands otherwise—which in fact it almost never will.) Thus, the DBMS is responsible for (a) accepting user requests, be they queries or updates, that are expressed in terms of the logical database and (b) responding to those requests by interpreting and implementing them, or in other words executing them, in terms of the physical database. Note: The term update, in lower case, is used to refer generically to requests that insert new data or delete or change existing data.
So the DBMS “protects users from the data” (i.e., it protects users from the details of how the data is physically represented inside the system). We might also say, somewhat glibly, that it protects the data from users!—by which I mean it provides certain security, concurrency, integrity, and recovery controls. To elaborate briefly:
Security controls are needed to ensure that user requests are legitimate, in the sense that the user in question is requesting an operation he or she is allowed to carry out on data he or she is allowed to access. In the case of the suppliers-and-parts database, for example, some users might not be allowed to see supplier status values; others might not be allowed to see suppliers at all; others might be allowed to see suppliers in London but not in other cities; others might be allowed to retrieve supplier information but not to update it; and so on. In a nutshell, users must be limited to performing only those operations they’re allowed to perform. Note: Security is important, of course, but further details of security controls are mostly beyond the scope of this book (except for a brief mention in Chapter 7).
Concurrency controls have to do with the possibility that several users might be using the database at the same time. Suppose you were to ask the database whether there were any shipments for supplier S1 and, on receiving an answer in the affirmative, went on to ask what the average quantity was for those shipments; it would be very annoying (to say the least) to then be told there weren’t any such shipments after all—presumably because some other user had just deleted them. Concurrency controls are intended to take care of such issues, and I’ll have more to say about them in Part II of this book.
Integrity controls have to do with guaranteeing that the data in the database is correct (insofar as it’s possible to provide any such guarantee). For example, an attempt to insert a shipment for supplier S6 into the shipments table must surely be rejected if there’s no supplier S6 in the suppliers table. Likewise, an attempt to update the status for supplier S1 to 200 must also be rejected, if status values are supposed never to exceed 100. As these examples should be sufficient to illustrate, integrity controls are extremely important, and I’ll have a lot more to say about them in Chapter 3 and Chapter 6 (also in Chapter 13).
Recovery controls have to do with the fact that the database is supposed never to “forget” anything it has been told to “remember.” That is, data, once it has been inserted into the database, should never be deleted again, except as a result of some explicit user request—not even if some failure occurs, such as a system crash or a hard crash on the disk. I’ll have a little more to say about such matters in Part II of this book.
There’s one last thing I need to say regarding DBMSs in general. From everything I’ve said so far, it should be quite clear that there’s a logical difference between a database, which is a repository for data, and a DBMS, which is the software that manages such a repository. Unfortunately, it has become common in the database field—I’m tempted to say, extremely common—to use the term database to mean a DBMS! I do not follow that usage in this book. The problem is, if you call the DBMS a database, what do you call the database?
So now we know what a DBMS is. But what’s a relational DBMS? Well, first, of course, it’s a DBMS, which means it provides all of the functionality mentioned in this chapter so far: data storage, query and update, recovery and concurrency, security and integrity, as well as other functions not discussed in this book. But second, it’s also relational, which means the user interface is based on—better: is a faithful implementation of—the relational model. In other words, the relational model can be thought of as a kind of recipe for what the user interface is supposed to look like in such a DBMS.
Before going any further, I’d like to stress the fact that that recipe (viz., the relational model) is very simple! The prescriptions of the relational model aren’t a straitjacket—rather, they’re a discipline, a discipline that makes life much easier for the user. (Easier for the system too, in certain respects; but the emphasis is on the user.) Now, it’s true that the recipe might sometimes look a little complicated, but that’s because the relational model is, above all, precise, and precision requires precise terminology, and precise terminology can sometimes be a little daunting. But the concepts the terminology refers to are actually quite simple (after all, the suppliers-and-parts database was pretty easy to understand, wasn’t it?). In fact, it seems to me that the concepts in question are much simpler than their counterparts were in older, prerelational (and nonrelational) systems such as IMS and IDMS.
To repeat, then, a relational DBMS is a DBMS that supports a user interface that’s a faithful implementation of the relational model—meaning, as far as the user is concerned, that:
The data looks relational.
Relational operators (i.e., operators that operate on data in relational form) are available to serve as a basis for formulating retrieval and update requests. Here’s a simple example:
S WHERE CITY = 'London'
This expression—which of course represents the query “Get suppliers in London”—is making use of the relational restriction operator; formally, it’s asking for the set of suppliers represented in table S to be restricted to just the ones in London.
In this first part of the book, then, we’ll take a closer look at exactly what it means for data to “look relational,” and we’ll examine various relational operators and see how they can be used in practice. Please note, however, that the treatment isn’t meant to be exhaustive (a more complete treatment of these topics can be found in SQL and Relational Theory); but it is meant to be comprehensive, as far as it goes, and it’s certainly meant to be accurate.
Unfortunately, there’s a problem. In order to illustrate the concepts I’m going to be discussing, I obviously need to show coding examples; thus, I clearly need some formal language in which to express those examples. But the relational model doesn’t prescribe any such language; that is, it doesn’t prescribe a concrete syntax for how its concepts are to be realized in practice. Rather, it’s defined at a high level of abstraction, and is in principle capable of concrete realization in any number of different syntactic forms. Now, a standard concrete language does exist: viz., SQL (pronounced “ess cue ell,” or sometimes “sequel”), which is supported, more or less, by all of the mainstream database products on the market today. As noted in the preface, however, SQL is very deeply flawed: It’s complex, incomplete, hard to learn, and indeed actively misleading in numerous ways. So what I plan to do in this book is this:
First, I’ll explain the relational model without using SQL at all (that’s Part I of the book). In place of SQL, I’ll use a language called Tutorial D—note the boldface—that has been expressly designed for the purpose. Note: I believe Tutorial D is pretty much self-explanatory; however, a comprehensive description can be found if needed in the book Databases, Types, and the Relational Model: The Third Manifesto, by Hugh Darwen and myself (3rd edition, Addison-Wesley, 2007).
Second, I’ll show how ideas from the relational model are realized in concrete form in SQL specifically (that’s Part III of the book). Please be aware, therefore, that I’m definitely not attempting to cover the whole of the SQL language in this book, but only as much as I need—what I called in the preface the core features of the language.
(As fort Part II, a brief explanation of what that’s about can be found in the preface.) Note: Perhaps I should warn you that, precisely because of the foregoing plan, what follows doesn’t look much like most of the books and presentations currently available in the marketplace that purport to be “an introduction to relational databases.”
One last point to close this section: You might or might not know (but I hope you do) that the relational model was originally the invention of E. F. Codd, when he was employed as a researcher at IBM (E for Edgar and F for Frank—but he always signed with his initials; to his friends, among whom I was proud to count myself, he was Ted). It was late in 1968 that Codd, a mathematician by training, first realized that the discipline of mathematics could be used to inject some solid principles and rigor into a field, database management, that prior to that time was all too deficient in any such qualities. His original description of the ideas of the relational model appeared in an IBM Research Report in 1969 (see Appendix E for further discussion).
The point doesn’t often seem to be recognized, but there are some analogies that can helpfully be drawn between database systems (relational or otherwise) and programming systems; in fact, it doesn’t often seem to be recognized that a database system actually is a programming system (more precisely, a special case of such a system). Consider the code fragment shown in Figure 1-3. The purpose of that code fragment—which is expressed in a hypothetical but self-explanatory language—is to compute and display the sum of the integers in a certain one-dimensional array called A.
Note the following points:
Statements: The code overall consists of nine statements. A statement in a programming language is a construct that causes some action to occur, such as defining or updating a variable or changing the flow of control. Observe that there’s a logical difference between a statement and an expression, which is a construct that denotes a value (it can be thought of as a rule for computing, or determining, the value in question). In Figure 1-3, for example, “I := I + 1 ;” is a statement—an assignment statement, as it happens (see further discussion below)—while “I + 1” is an expression. Note: Throughout this book, I adopt the common syntactic convention that statements terminate in a semicolon and expressions don’t.
Types: A type is a named set of values: namely, all legal values of some particular kind. Every value and every variable (see further discussion below) is of some type. The code fragment in Figure 1-3 involves three types: INTEGER (the set of all integers); ARRAY [1..N] OF INTEGER (the set of all one-dimensional arrays of integers with lower bound 1 and upper bound N); and CHAR (the set of all character strings). Note: As we’ll see in a few moments—see the discussion of comparison operators below—it also implicitly involves type BOOLEAN, the set of all truth values. Of course, there are exactly two such values, denoted by the literals TRUE and FALSE.
Variables: A variable is a container for a value (different values at different times, in general). The code fragment in Figure 1-3 involves four variables: I, N, SUM, and A. The “current value” of a given variable—meaning the value the variable contains at some particular time—can be changed in one and only one way: namely, by executing an assignment statement in which the variable in question serves as the target. In fact, to be a variable is to be assignable to, and to be assignable to is to be a variable.
Assignment: Assignment (denoted “:=” in Figure 1-3) is an operator for updating a variable—that is, assigning a value, probably different from the previous value, to the variable in question. The code fragment in Figure 1-3 contains four assignment statements.
Literals: A literal is a “self-defining symbol”—i.e., it’s a symbol that denotes a value, and the value in question, and hence the type of that value also, are fixed and determined by the symbol in question (and are therefore known at compile time, incidentally). The code fragment in Figure 1-3 involves three literals—0, 1, and ‘The sum is’ (the first two of these are of type INTEGER and the third is of type CHAR).
Values: A value is an “individual constant”; it’s what’s denoted by an expression, and it’s what can be assigned to a variable. Note in particular that literals and variable references are both expressions, since they certainly both denote values. Note too that every value effectively carries its type around with it.
Read-only operators: A read-only operator is an operator like “+” that “derives new values from old ones”; for example, the expression 2 + 3 derives the “new” value 5 from the “old” values 2 and 3. Note carefully that a read-only operator, when it’s invoked, returns a result but doesn’t update anything (in particular, it doesn’t update its operands). Note too that an expression, which I said earlier was a rule for computing a value, can equivalently be said to represent an invocation of some read-only operator; in fact, the terms expression and read-only operator invocation are effectively interchangeable. Note finally that read-only operators don’t have to be denoted by a special symbol such as “+”; in fact, the vast majority of such operators are denoted by means of some more conventional identifier. For example, many programming languages support a read-only operator called RANDOM (or some such) for generating pseudorandom numbers.
Comparison operators: A comparison operator is an operator like “<” that, when it’s invoked, returns a truth value (either TRUE or FALSE). In fact, of course, such operators are in fact read-only operators; they’re just a special case, in which the type of the result they return is BOOLEAN.
Finally, the reason for rehearsing all of this extremely familiar material is that (as I hope you were expecting) all of the foregoing concepts are directly relevant to databases, as we’ll see in the chapters to come.
I need to say a little more about the concept of types in particular. The code fragment in Figure 1-3 didn’t illustrate the point, but types in general can be either system defined (i.e., built in) or user defined. They can also be arbitrarily complex. For example, in a geometric application, we might have user defined types called POINT, LINE, RECTANGLE, CIRCLE, and so on. However, to the user who merely uses them (as opposed to the user who actually defines them), these user defined types will look exactly like system defined types anyway—indeed, that’s the whole point (or a large part of the point, at any rate). So little or no generality is lost if, in my examples later in this book, I limit myself to system defined types such as INTEGER and CHAR (see below), and so I will, most if not quite all of the time.
Next, I’ve said that every value is of some type. It follows that every variable, every parameter to every operator, every read-only operator, and every expression—in particular, every literal and every variable reference—is of some type as well, because all of these constructs, when they’re used, certainly do denote values. To be specific:
In the case of variables, parameters, and read-only operators, the type in question is specified when the construct in question is defined. For example, see the variable definitions—i.e., the VAR statements—in Figure 1-3.
In the case of expressions, the type in question is simply the type of the result returned when the expression in question is evaluated. For example, the expression 2 + 3 is of type INTEGER, because the result 5 of evaluating that expression is of type INTEGER.
Third and last, it’s important to understand that associated with any given type T, there’s a set of operators defined for operating on values and variables of type T (because types without operators are useless). For example, in the case of type INTEGER, which for simplicity I take to be system defined, the agency responsible for defining the type—in other words, the system, by my assumption—must define:
Operators “:=”, “=”, “<”, etc., for assigning and comparing integers
Operators “+”, “*”, etc., for performing arithmetic on integers
Perhaps a CAST operator, for converting integers to character strings
It’s also important to understand that the set of operators associated with any given type T must include both assignment (“:=”) and equality comparison (“=”). What’s more, the semantics of these operators must be such that the following requirements are satisfied:
Assignment: After assignment of value v to variable V, the comparison V = v must give TRUE. Note: This requirement is sometimes referred to as The Assignment Principle.
Equality: The comparison v1 = v2 must give TRUE if and only if v1 and v2 are the very same value (implying, incidentally, that they must certainly be of the same type). Note the following important corollary: If there exists some operator Op such that Op(v1) ≠ Op(v2), then v1 = v2 must give FALSE.
Note: Of these two operators (“:=” and “=”), equality in particular is absolutely fundamental—for without it, we couldn’t even tell, given some value v and some set of values S, whether v appears in S (i.e., whether v is an element of S).
Finally, let me say again that the reason for rehearsing all of this extremely familiar material is that once more (as I hope you were expecting) all of these concepts are directly relevant to databases, as we’ll see in the chapters to come.
Now it’s your turn. Of course, it isn’t possible to set any particularly searching exercises at this early point in the book, and the following are little more than review questions. Nevertheless, I’d like to recommend that you try to answer them yourself before going on to read my answers in the next section. Note: The first two exercises in particular are slightly unfair, because I haven’t yet told you enough to answer them properly—but I think you should have a go at them anyway. They’re not very difficult.
1.1 a. Parts with weight less than 12.5. This example involves a relational restriction operation. b. Part number, color, and city for every part. This part of the exercise was definitely unfair, but you probably guessed the answer anyway—just as restriction picks out certain rows, so the operation of projection, illustrated by this example, picks out certain columns. See Chapter 4 for further explanation.
1.2 a. Delete all suppliers with status 10. b. Increase the status by 5 for all suppliers with status greater than 10. By the way, notice the use of the keyword UPDATE (in upper case) in this exercise. Perhaps a little confusingly, it has become standard practice in the database world to use the uppercase term UPDATE to refer to the specific operator that changes existing data (as opposed to DELETE, which deletes existing data, and INSERT, which inserts new data), and the lowercase term update to refer to the INSERT, DELETE, and UPDATE operators considered generically. In this book, therefore, when I want to refer to the UPDATE operator as such, I’ll set it in upper case (“all caps”) as just shown. As for the INSERT and DELETE operators, however, where no ambiguity arises, it can be a little tedious always to set them in all caps—especially when they’re being used as qualifiers, as in, e.g., “INSERT statement” (“insert statement”?). I’ve therefore decided to use both forms in this book, letting context be my guide in any given situation (and I won’t pretend I’ve been all that consistent in this regard, either).
1.3 A database is a repository for data stored electronically (an “electronic filing cabinet”). A DBMS is a software system that manages databases and access to those databases.
1.4 Security controls protect the database from unauthorized operations. Integrity controls protect the database from authorized but invalid operations. Concurrency controls protect user operations from interfering with one another. Recovery controls protect against data loss.
1.5 SQL is the standard language for interacting with “relational” databases; it’s supported by just about every mainstream database product on the market today. Tutorial D is a language expressly designed for use in illustrating relational concepts; prototype implementations do exist—see the website www.thethirdmanifesto.com—but, at the time of writing, no commercial products.
1.6 A type is a named set of values. A value is an “individual constant”—e.g., the integer three—and it can’t be changed. A variable is a holder for (some encoded representation of) some value, and it can be changed—i.e., its current value can be replaced by another value. A literal is a “self-defining symbol” denoting a value—e.g., the numeral 3. Assignment is the operator by which a variable is updated. A comparison is a read-only operator that (usually) takes two values as operands and returns a truth value. A read-only operator is an operator that takes zero or more values as operands and returns a value. An update operator is an operator that updates some variable; in other words, it’s an assignment, logically speaking. Note: The operators INSERT, DELETE, and UPDATE mentioned in the answer to Q: are all relational update operators, and are thus by definition all effectively relational assignments. For further explanation of this point, see Chapter 2.
1.7 See Figure 1-1.
 But still a computer professional, not a genuine “end user,” who might reasonably be quite ignorant of most of the matters to be discussed in this book.
 I remind you from the preface that throughout this book I use “SQL and Relational Theory” as an abbreviated form of reference to my book SQL and Relational Theory: How to Write Accurate SQL Code (2nd edition, O’Reilly, 2012).
 By the way, that physical database, although it’s less abstract than the logical database, is still an abstraction; it consists, typically, of things like stored files and indexes, and those stored files and indexes are represented in terms of various lower level constructs such as pages and disk spaces. And these latter constructs in turn are abstractions of still lower level constructs such as bits and bytes—which are, of course, themselves abstractions as well (and so on, as deep as you care to go).
 A typical example, found almost at random on the Internet: “MySQL [is] the world’s most popular open source database.” No, it isn’t. It might be the world’s most popular open source DBMS—I wouldn’t know—but a database it’s most certainly not.
 Much simpler too than their counterparts in the various proposals—which crop up all too frequently, I’m sorry to say—for replacing the relational model (see, e.g., XML, NoSQL, the Actor Model, and on and on). Incidentally, I’ve never seen such a proposal in which the person doing the proposing really understood the relational model. Surely, if you want to claim that Technology A is no good and needs to be replaced by Technology B, then it’s incumbent on you to understand Technology A in the first place? And, more particularly, to demonstrate how Technology B solves some specific problem that Technology A doesn’t?
 Tutorial D has been revised and extended somewhat since that book was first published. A description of the revised version (which is essentially the version I’ll be using in this book) can be found in another book by Hugh Darwen and myself, Database Explorations: Essays on The Third Manifesto and Related Topics (Trafford, 2010). See also www.thethirdmanifesto.com.
 In fact, of exactly one type, unless type inheritance is a consideration, which in this book it isn’t. Note: It follows that types are disjoint (at least as far as we’re concerned)—no value is of two or more types.
 The obvious question arises: Is there any such thing as an operator that isn’t read-only? The answer is yes, there are update operators, which are operators that, when invoked, don’t return a result but do update some variable. As we’ll see later, however, any given update operator invocation is functionally equivalent to some assignment operation. It follows that, logically speaking, assignment is the only update operator we need.