C’est le premier pas qui, dans toutes les guerres, décèle le génie.
It is the first step that reveals genius in all wars.
—Joseph de Maistre (1754-1821) Lettre du 27 Juillet 1812 à Monsieur le Comte de Front
The great nineteenth century German strategist, Clausewitz, famously remarked that war is the continuation of politics by other means. Likewise, any computer program is, in one way or another, the continuation of the general activity within an organization, allowing it to do more, faster, better, or cheaper. The main purpose of a computer program is not simply to extract data from a database and then to process it, but to extract and process data for some particular goal. The means are not the end.
A reminder that the goal of a given computer program is first of all to meet some business requirement [*] may come across as a platitude. In practice, the excitement of technological challenges often slowly causes attention to drift from the end to the means, from upholding the quality of the data that records business activity to writing programs that perform as intended and in an acceptable amount of time. Like a general in command of his army at the beginning of a campaign, we must know clearly what our objectives are—and we must stick to them, even if unexpected difficulties or opportunities make us alter the original plan. Whenever the SQL language is involved, we are fighting to keep a faithful and consistent record of business activity over time. Both faithfulness and consistency are primarily associated with the quality of the database model. The database model that SQL was initially designed to support is the relational model . One cannot overemphasize the importance of having a good model and a proper database design, because this is the very foundation of any information system.
A database is nothing but a model of a small part of a real-life situation. As any representation, a database is always an imperfect model, and a very narrow depiction of a rich and complex reality. There is rarely a single way to represent some business activity, but rather several variants that in a technical sense will be semantically correct. However, for a given set of processes to apply, there is usually one representation that best meets the business requirement.
The relational model is thus named, not because you can relate tables to one another (a popular misconception), but as a reference to the relationships between the columns in a table. These are the relationships that give the model its name; in other words, relational means that if several values belong to the same row in a table, they are related. The way columns are related to each other defines a relation, and a relation is a table (more exactly, a table represents one relation).
The business requirements determine the scope of the real-world situation that is to be modeled. Once you have defined the scope, you can proceed to identify the data that you need to properly record business activity. If we say that you are a used car dealer and want to model the cars you have for sale (for instance to advertise them on a web site), items such as make, model, version, style (sedan, coupe, convertible...), year, mileage, and price may be the very first pieces of information that come to mind. But potential buyers may want to learn about many more characteristics to be able to make an informed choice before settling for one particular car. For instance:
General state of the vehicle (even if we don’t expect anything but “excellent”)
Manual or automatic transmission
Color (body and interiors), metallic paintwork or not, upholstery, hard or soft top, perhaps a picture of the car
Seating capacity, trunk capacity, number of doors
Power steering, air conditioning, audio equipment
Engine capacity, cylinders, horsepower and top speed, brakes (everyone isn’t a car enthusiast who would know technical specifications from the car description)
Fuel, consumption, tank capacity
Current location of the car (may matter to buyers if the site lists cars available from a number of physical places)
And so on.. .
If we decide to model the available cars into a database, then each row in a table summarizes a particular statement of fact—for instance, that there is for sale a 1964 pink Cadillac Coupe DeVille that has already been driven twenty times around the Earth.
Through relational operations, such as joins, and also by filtering, selection of particular attributes, or computations applied to attributes (say computing from consumption and tank capacity how many miles we can drive without refueling), we can derive new factual statements. If the original statements are true, the derived statements will be true.
Whenever we are dealing with knowledge, we start with facts that we accept as truths that need no proof (in mathematics these are known as axioms , but this argument is by no means restricted to mathematics and you could call those unproved true facts principles in other disciplines). It is possible to build upon these true facts (proving theorems in mathematics) to derive new truths. These truths themselves may form the foundations from which further new truths emerge.
Relational databases work in exactly the same way. It is absolutely no accident that the relational model is mathematically based. The relations we define (which once again means, for an SQL database, the tables we create) represent facts that we accept, a priori, as true. The views we define, and the queries we write, are new truths that we prove.
The coherence of the relational model is a critically important concept to grasp. Because of the inherent mathematical stability of the principles that underlie relational data modeling , we can be totally confident that the result of any query of our original database will indeed generate equally valid facts—if we respect the relational principles. Some of the key principles of the relational theory are that a relation, by definition, contains no duplicate, and that row ordering isn’t significant. As you shall see in Chapter 4, SQL allows developers to take a number of liberties with the relational theory, liberties that may be the reasons for either surprising results or the failure of a database optimizer to perform efficiently.
There is, however, considerable freedom in the choice of our basic truths. Sometimes the exercise of this freedom can be done very badly. For example, wouldn’t it be a little tedious if every time someone went to buy some apples, the grocer felt compelled to prove all Newtonian physics before weighing them? What must be thought of a program where the most basic operation requires a 25-way join?
We may use much data in common with our suppliers and customers. However, it is likely that, if we are not direct competitors, our view of the same data will be different, reflecting our particular perspective on our real-life situation. For example, our business requirements will differ from those of our suppliers and customers, even though we are all using the same data. One size doesn’t fit all. A good design is a design that doesn’t require crazy queries.
Normalization, and especially that which progresses to the third normal form (3NF), is a part of relational theory that most students in computer science have been told about. It is like so many things learned at school (classical literature springs to mind), often remembered as dusty, boring, and totally disconnected from today’s reality. Many years later, it is rediscovered with fresh eyes and in light of experience, with an understanding that the essence of both principles and classicism is timelessness.
The principle of normalization is the application of logical rigor to the assemblage of items of data—which may then become structured information. This rigor is expressed in the definition of various normal forms, most typically three, although purists argue that one should analyze data beyond 3NF to what is known in the trade as Boyce-Codd normal form (BCNF), or even to fifth normal form (5NF). Don’t panic. We will discuss only the first three forms. In the vast majority of cases, a database modeled in 3NF will also be in BCNF[*] and 5NF.
You may wonder why normalization matters. Normalization is applying order to chaos. After the battle, mistakes may appear obvious, and successful moves sometimes look like nothing other than common sense. Likewise, after normalization the structures of the various tables in the database may look natural, and the normalization rules are sometimes dismissively considered as glorified common sense. We all want to believe we have an ample supply of common sense; but it’s easy to get confused when dealing with complex data. The three first normal forms are based on the application of strict logic and are a useful sanity checklist.
The odds that our creating un-normalized tables will increase our risk of being struck by divine lightning and reduced to a little mound of ashes are indeed very low (or so I believe; it’s an untested theory). Data inconsistency, the difficulty of coding data-entry controls, and error management in what become bloated application programs are real risks, as well as poor performance and the inability to make the model evolve. These risks have a very high probability of occurring if we don’t adhere to normal form, and I will soon show why.
How is data moved from a heterogeneous collection of unstructured bits of information into a usable data model? The method itself isn’t complicated. We must follow a few steps, which are illustrated with examples in the following subsections.
First of all, we must ensure that the characteristics, or attributes, we are dealing with are atomic. The whole idea of atomicity is rather elusive, in spite of its apparent simplicity. The word atom comes from ideas first advanced by Leucippus, a Greek philosopher who lived in the fifth century B.C., and means “that cannot be split.” (Atomic fission is a contradiction in terms.) Deciding whether data can be considered atomic or not is chiefly a question of scale. For example, a regiment may be an atomic fighting unit to a general-in-chief, but it will be very far from atomic to the colonel in command of that regiment, who deals at the more granular level of battalions or squadrons. In the same way, a car may be an atomic item of information to a car dealer, but to a garage mechanic, it is very far from atomic and consists of a whole host of further components that form the mechanic’s perception of atomic data items.
From a purely practical point of view, we shall define an
atomic attribute as an attribute that, in a
where clause, can always be
referred to in full. You can split and chop an attribute as much as
you want in the select list (where it is returned); but if you need to
refer to parts of the attribute inside the
where clause, the attribute lacks the level
of atomicity you need. Let me give an example. In the previous list of
attributes for used cars, you’ll find “safety equipment,” which is a
generic name for several pieces of information, such as the presence
of an antilock braking system (ABS), or airbags (passenger-only,
passenger and driver, frontal, lateral, and so on), or possibly other
features, such as the centralized locking of doors. We can, of course,
define a column named
safety_equipment that is just a description
of available safety features. But we must be aware that by using a
description we forfeit at least two major benefits:
If some users consider ABS critical because they often
drive on wet, slippery roads, a search that specifies “ABS” as
the main criterion will be very slow if we must search column
safety_equipment in every row
for the “ABS” substring. As I’ll show in Chapter 3, regular indexes
require atomic (in the sense just defined) values as keys. One
can sometimes use query accelerators other than regular indexes
(full-text indexing, for instance), but such accelerators
usually have drawbacks, such as not being maintained in real
time. Also take note that full-text search may produce awkward
results at times. Let’s take the example of a
color column that contains a
description of both body and interior colors. If you search for
“blue” because you’d prefer to buy a blue car, gray cars with a
blue interior will also be returned. We have all experienced
irrelevant full-text search results through web searches.
Data-entry is prone to error. More importantly than
dissuasive search times, if “ASB” is entered instead of “ABS”
into a descriptive string, the database management system will
have no way to check whether the string “ASB” is meaningful. As
a result, the row will never be returned when a user specifies
“ABS” in a search, whether as the main or as a secondary
criterion. In other words, some of our queries will return wrong
results (either incomplete, or even plain wrong if we want to
count how many cars feature ABS). If we want to ensure data
correctness, our only means (other than double-checking what we
have typed) is to write some complicated function to parse and
analyze the safety equipment string when it is entered or
updated. It is hard to decide what will be worse: the hell that
the maintenance of such a function would be, or the performance
penalty that it will inflict on loads. By contrast, a mandatory
has_ABS column would not
guarantee that the information is correct, but at least
declarative check constraints can make the DBMS reject any value
other than Y or N.
Partially updating a complex string of data requires first-rate mastery of string functions. Thus, you want to avoid cramming multiple values into a single string.
Defining data atoms isn’t always a simple exercise. For example, the handling of addresses frequently raises difficult questions about atomicity. Must we consider the address as some big, opaque string? Or must we break it into its components? And if we decompose the address, to what level should we split it up? Remember the points made earlier about atomicity and business requirements. How we represent an address actually depends on what we want to do with the address. For example, if we want to compute statistics or search by postal code and town, then it is desirable to break the address up into sufficient attribute components to uniquely identify those important data items. The question then arises as to how far this decomposition of the address should be taken.
The guiding principle in determining the extent to which an address should be broken into components is to test each component against the business requirements, and from those requirements derive the atomic address attributes. What these various address attributes will be cannot be predicted (although the variation is not great), but we must be aware of the danger of adopting an address format just because some other organization may have chosen it, before we have tested it critically against our own business needs.
Note that sometimes, the devil is in the details. By trying to be too precise, we may open the door to many distracting and potentially irrelevant issues. If we settle for a level of detail that includes building number and street as atomic items, what of ACME Corp, the address of which is simply “ACME Building”? We should not create design problems for information we don’t need to process. Properly defining the level of information that is needed can be particularly important when transferring data from an operational to a decision-support system.
Once all atomic data items have been identified, and their mutual interrelationships resolved, distinct relations emerge. The next step is to identify what uniquely characterizes a row—the primary key. At this stage, it is very likely that this key will be a compound one, consisting of two or more individual attributes. To go on with our used car example, for a customer it’s the combination of make, model, version, style, year, and mileage that will identify a particular vehicle—not the current registration number. It isn’t always easy to correctly define a key. A good, classic example of attribute analysis is the business definition of “customer.” A customer may be identified by a name. However, a name may not be the best identifier. If our customers are companies, the way we identify them may be the source of ambiguities—is it “RSI,” “Relational Software,” “Relational Software Inc” (with or without a dot following “Inc,” with or without a comma after “Relational Software”) that identifies this given company? Uppercase? Lowercase? Capitalized initials? We have here all the conditions for storing information inside a database and never seeing it again. The choice of the customer name as identifier is a challenging one, because it demands the strict application of naming standards to avoid possible ambiguities. It may be preferable to identify a customer on the basis of either a standard short name, or possibly by use of a unique code. And one should always keep in mind the impact on related data of Relational Software Inc. changing its name to, say, Oracle Corporation. If we need to keep a history of our relationship, then we must be able to identify both names as representing the same company at different points in time.
As a general rule, you should, whenever possible, use a unique
identifier that has meaning rather than some obscure sequential
integer. I must stress that the primary key is what characterizes the
data—which is not the case with some sequential identifier associated
with each new row. You may choose to add such an identifier later, for
instance because you find your own
company_id easier to handle than the place
of incorporation and registration number that truly identify a
company. You can even promote the sequential identifier to the envied
status of primary key, as a technical substitute (or shorthand) for
the true key, in exactly the same way that you’d use table aliases in
a query in order to be able to write:
where a.id = b.id
where table_with_a_long_name.id = table_even_worse_than_the_other.id
But a technical, numerical identifier doesn’t constitute a real primary key by the mere virtue of its existence and mustn’t be mistaken for the real thing. Once all the attributes are atomic and keys are identified, our data is in first normal form (1NF).
I have pointed out that some of the information that we
should store to help used car buyers make an informed choice would
already be known by a car enthusiast. In fact, many used car
characteristics are not specific to one particular car. For example,
all the cars sharing make, model, version, and style will have the
same seating and cargo capacity, regardless of year and mileage. In
other words, we have attributes that depend on only a part of the key.
What are the implications of keeping them inside a
If we happen to have for sale many cars of the same make, model, version, and style (a set of characteristics that we can generically call the car model), all the attributes that are not specific to one particular car will be stored as many times as we have cars of the same model. There are two issues with the storage of redundant data . First, redundant data increases the odds of encountering contradictory information because of input errors (and it makes correction more time-consuming). Second, redundant data is an obvious storage waste. It is customary to hear that nowadays storage is so cheap that one no longer needs to be obsessed with space. True enough, except that such an argument overlooks the fact that there is also more and more data to store in today’s world. It also overlooks the fact that data is often mirrored, possibly backed up to other disks on a disaster recovery site where it is mirrored again, and that many development databases are mere copies of production databases. As a result, every wasted byte isn’t wasted once, but four or five times in the very best of cases. When you add up all the wasted bytes, you sometimes get surprisingly high figures. Besides the mere cost of storage, sometimes—more importantly—there is also the issue of recovery. There are cases when one experiences “unplanned downtime,” a very severe crash for which the only solution is to restore the database from a backup. All other things being equal, a database that is twice as big as necessary will take twice the time to restore than would otherwise be needed. There are environments in which a long time to restore can cost a lot of money. In a hospital, it can even cost lives.
A table that contains a lot of information (with a large
number of columns) takes much longer to scan than a table with a
reduced set of columns. As we shall see in other chapters, a
full table scan is not necessarily the scary situation that many
beginners believe it to be; there are many cases where it is by
far the best solution. However, the more bytes in the average
row, the more pages will be required to store the table, and the
longer it takes to scan the table. If you want to display a
selectable list of the available car models, an un-normalized
table will require a
distinct applied to all the available cars. Running a
select distinct doesn’t mean
only scanning many more rows than we would with a separate
car_model table, but it also
means having to sort those rows to eliminate duplicates. If the
data is split in such a way that the DBMS engine can operate
against only a subset of the data to resolve the query,
performance will be significantly better than when it operates
against the whole.
To remove dependencies on a part of the key, we must create
tables (such as
keys of those new tables will each be a part of the key for our
original table (in our example, make, model, version, and style). Then
we must move all the attributes that depend on those new keys to the
new tables, and retain only make, model, version, and style in the
original table. We may have to repeat this process, since the engine
and its characteristics will not depend on the style. Once we have
completed the removal of attributes that depend on only a part of the
key, our tables are in second normal form
When all data has been correctly moved into 2NF, we can commence the process of identifying the third normal form (3NF). Very often, a data set in 2NF will already be in 3NF, but nevertheless, we should check the 2NF set. We now know that each attribute in the current set is fully dependent on the unique key. 3NF is reached when we cannot infer the value of an attribute from any attribute other than those in the unique key. For example, the question must be asked: “Given the value of attribute A, can the value of attribute B be determined?”
International contact information provides an excellent example
of when you can have an attribute dependent on another non-key
attribute: if you know the country, you need not record the
international dialing code with the phone number (the reverse is not
true, since the United States and Canada share the same code). If you
need both bits of information, you ought to associate each contact
with, say, an ISO country code (for instance IT for Italy), and have a
country_info table that
uses the country code as primary key and that holds useful country
information that your business requires. For instance, a
country_info table may record that the
international dialing code for Italy is 39, but also that the Italian
currency is the euro, and so on. Every pair of attributes in our 2NF
data set should be examined in turn to check whether one depends on
the other. Such checking is a slow process, but essential if the data
is to be truly modeled in 3NF. What are the risks associated with not
having the data modeled in 3NF? Basically you have the same risks as
from not respecting 2NF.
There are various reasons that modeling to the third normal form is important. (Note that there are cases in which designers deliberately choose not to model in third normal form; dimensional modeling, which will be briefly introduced in Chapter 10, is such a case. But before you stray from the rule, you must know the rule and weigh the risks involved.) Here are some reasons:
As Chapter 10 will show, a non-normalized model such as the dimensional one finds its justification in assumptions about how the data is maintained and queried (the same can be said of the physical data structures that you’ll see in Chapter 5; but a physical implementation change will not jeopardize the logic of programs, even if it can seriously impact their performance). If the assumptions prove wrong one day, all you can do is throw everything away and rebuild from scratch. By contrast, a 3NF model may require some query adjustments, but it will be flexible enough to accommodate changes.
As I have already pointed out, duplicate data is costly, both in terms of disk space and processing power, but it also introduces a much-increased possibility of data becoming corrupt. Corruption happens when one instance of a data value is modified, but the same data held in another part of the database fails to be simultaneously (and identically) modified. Losing information doesn’t only mean data erasure: if one part of the database says “white” while another part says “black,” you have lost information. Data inconsistency can be prevented by the DBMS if the modeling allows it—if your atomic attributes let you define column constraints, or if you can declare referential integrity constraints. Otherwise, it has to be prevented by additional programming traps. You then have the choice between using triggers and stored procedures that can grow very complex and add significant overhead, or making programs unnecessarily complicated and therefore costlier to maintain. Triggers and stored procedures must be extremely well documented. Data consistency ensured in programs moves the protection of data integrity out of the database and into the application layer. Any other program that needs to access the same data has the choice between duplicating the data integrity protection effort, or happily corrupting the data painfully maintained in a consistent state by other programs.
A very common modeling mistake is to associate large numbers of possible characteristics within a relation, which may result in a table with a large number of columns. Some scientific disciplines may require a very detailed characterization of objects under study, and thus require a large number of attributes, but this is rarely the case in business applications. In any case, a sure sign that a database design is flawed is when columns of some prominent tables mostly contain null values , and especially when two columns cannot possibly contain a value at the same time; if one is defined, the other must be null, and vice versa. This condition would undoubtedly indicate a violation of either 2NF or 3NF.
If we admit that a row in a table represents a statement about the characteristics of a given “thing,” indicating that “we don’t know” for most characteristics seriously downgrades the table as a source of reliable information. This may be a minor inconvenience if the data is stored for informative purpose only. It becomes a major issue if the unknown values are supposed to help us define a result set, and this state of affairs is indicative of a flawed model. All columns in a row should ultimately contain a value, even if business processes are such that various pieces of information are entered from more than one source and/or at different points in time. A stamp collector might likewise keep some room in an album for a series temporarily absent from the collection. But even so, there is a risk of wasting storage if it is actually reserved because one always tailors for the maximum size. There is also a risk of very serious performance problems if only placeholders are used and data goes to some remote overflow area when it is entered at last.
The existence of null values also raises an important point with
regard to relational modeling, which is the main foundation for the
query optimizer. The completeness of a relational model is founded on
the application of two-valued logic ; in which things are or they
aren’t. Any in-between case, a null value, is
indeterminate; but in a
conditions cannot be indeterminate. They are true or they are false,
because you return a row or you don’t; you cannot return a row with a
“maybe this one answers the question but I’m not really sure” qualifier.
The transition from the three-valued logic implied
by nulls (true, false, or indeterminate) to the two-valued logic of the
result set is perilous. This is why all SQL practitioners can recall
cases when what looked like a good SQL query failed to return the proper
result set because of an encounter with null values. For instance, if a
color contains the
BLACK, this condition:
where color not in ('BLUE', 'BLACK', null)
will result in no row being returned, because we don’t know what
null is and the SQL engine will
consider that there is a possibility that it might be
where color in ('BLUE', 'BLACK', null)
will return all rows for which
BLACK, and nothing else (remember, we have no
BLUE in our table), since there is a
null would be
GREEN. As you can see, an SQL engine is even
more risk-averse than a banker. Finding an explicit
null inside an
) list is, of course, unusual; but such a situation may occur
if, instead of an explicit list, we have a subquery and fail to ensure
that no null value is returned by that subquery.
A representation of customers can provide a very good example of the difficulties inherent to dealing with missing information. Each customer has an address, which is normally the address that will appear on an invoice to that customer. But what if the address to which we must ship our goods is different? Must we consider the shipping address to be a characteristic of the order? It can make sense if we sell once, only to never see customers again. If we are not a funeral parlor, however, and especially if we repeatedly ship goods to the same address, it makes no sense at all from a business point of view. Entering the same data over and over again, besides being a waste of time, also increases the risk of a mistake—hence goods get sent to the wrong address, creating a dissatisfied customer or perhaps an ex-customer. The shipping address is, obviously, a characteristic of the customer, not of the order. This situation ought to have been resolved in the analysis of dependencies during the original design of the model.
It is also possible to have the accounting department at a
location different from the official, customer delivery address if the
customer is a company. So, for one customer, we may have one “official”
address, a billing address, and also a shipping address. It is quite
common to see
customer tables with
three sets of columns (each set describing one address) for this
However, if we can have all these addresses, what is likely to be the most common case? Well, it is quite possible that in 90% of the cases we shall have only one useful address, the official address. So, what must we do with all our other columns? Two possibilities come to mind:
This is not a very sound strategy, because this will require our programs to use implicit rules , such as “if the billing address is undefined, then send the invoice to the corporate address.” The logic of such programs will become much more complicated, with an increased risk of bugs entering the code.
This approach will require special processing during data entry, by a trigger perhaps. In such a case the overhead may not matter much, but in another case the overhead might matter a lot. Moreover, we must also take care of replicating changes--each update of the corporate address must be replicated to those of the other addresses that are identical, for fear of inconsistency.
Both of these scenarios betray a critical lack of understanding on the part of the original modelers. Using null values and implicit rules is a classic fudge to accommodate three-valued logic. The use of nulls inevitably introduces three-valued logic, which immediately introduces semantic inconsistency ; no amount of clever programming can remove semantic issues. Replicating data illustrates what happens when dependencies have not been properly analyzed.
One solution to our address conundrum might be to get the address
information out of the
table. One design we may contemplate is to store each address in an
address table, together with a
customer identifier and some column (a bit mask, perhaps) indicating the
role of the address. But this is not necessarily
the best solution, because issues such as the true meaning of addresses
often appear after programs have been rushed into production and an
attempt to remodel the original data as part of a later release can
introduce insuperable problems.
We have so far assumed that we have one
shipping address for each customer, which may or may not be identical to
the corporate, registered address. What if we send our invoices to a
single place but must ship our goods to many different branches, with
several distinct shipments belonging to the same invoice? This is not
necessarily unusual! It is no longer workable for our design to have a
single (mostly null) “shipping address” (represented by several columns)
customer table. We are,
ironically, back to the “shipping address is a characteristic of the
order” situation. This means that if we want to refer (especially
repeatedly) to addresses in orders, we must associate some kind of
purpose-built identifier to our addresses, which will spare us repeating
the whole shipping address in each order (normalization in action). Or
perhaps we should begin to contemplate the introduction of a
There is no such thing as the totally perfect design for the customers/addresses conundrum. I have just wandered through likely problems and tried to sketch some of the possible solutions. But there will be one solution that works best in your case, and many other solutions that will lead to the risks of inconsistencies. With an inappropriate solution, code will be at best more complicated than necessary with very high odds of being underperforming as well.
The question of null values is probably the thorniest issue of the relational theory. Dr. E.F. Codd, the father of the relational model, introduced null values early, and explicitly asked in the 3rd of the 12 rules that he published in 1985 for a systematic treatment of null values. (The 12 rules were a concise definition of the required properties of a relational database.) However, the battle is still raging among theorists. The problem is that “not known” may encompass quite a number of different cases. Let’s consider a list of famous writers, each with a birth date and a death date. A null birth date would unambiguously mean “unknown.” But what does a null death date mean? Alive? We don’t know when this author died? We don’t know whether this author is alive or not?
I cannot resist the pleasure of quoting the immortal words of the then-U.S. Secretary of Defense, Mr. Donald Rumsfeld, at a February 2002 news briefing of his department:
As we know, there are known knowns. There are things we know we know. We also know there are known unknowns. That is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don’t know we don’t know.
I don’t find it unusual to have null values for, to put it in Rumsfeldese, “known unknowns,” attributes that are known to exist and have some value we don’t know at one point in time, for various reasons. For the rest, speculating leads nowhere. Strangely, some of the most interesting usages of null values may perfectly involve nothing but tables where all columns of all rows contain values: null values can be generated through outer joins. Some efficient techniques for checking the absence of particular values that I discuss in Chapter 6 are precisely based on outer joins and tests on null values.
Even though the Boolean type doesn’t exist in SQL, many
people feel a need to implement flags to indicate a Boolean true/false
status (for instance
order_completed). You should aim for
increasing the density of your data--
order_completed may be useful information to
know, but then perhaps other information would be nice to store too:
when was it completed? Who completed it? So that means that instead of
having a single “Y/N” column, we can have a
completion_date column, and perhaps a
completed_by column, both of which will tell
us more (although we may not necessarily want to see a null value as
long as the order isn’t completed; a solution may be to use a distinct
table to track the various stages of every order from creation to
completion). As before, examine the dependencies in the context of your
business requirements, and only include those additional columns where
the successful operation of the business requires it.
Alternatively, a series of essentially Boolean attributes can
sometimes be advantageously combined into a unique
status attribute. For instance, if you have
four attributes that can be either true or false, you can assign a
numerical value between 0 and 15 to each of the possible combinations
and define the “status” as being represented by this value. But
beware—this technique may offend the basic rule of atomicity, so if you
must use this approach, do so with considerable caution.
Another reason for the appearance of unnecessarily wide tables (as in having too many attributes) is a lack of understanding of the true relationship between data items. Consider the example of subtypes . A company may have a mix of employees, some of whom are permanent, others who are contractors. They all have several properties in common (name, year of birth, department, room, phone number, and so forth), but there are also properties that are unique to each type of employee (for instance, hire date and salary for permanent employees, rate and contract reference for contractors). The manner in which the common attributes can be shared, while ensuring that the distinctive features are kept separate, introduces the topic of subtypes.
We can model this situation by defining three tables. First, the
employee table contains all
information that is common to every employee, regardless of their
status. However, an attribute tells the status of each employee. It has
as many distinct values as there are distinct employee types, for
example “P” (for permanent employee), and “C” (for contract employee).
This table uses an employee number as the primary key.
Next, we create additional tables, one for each employee type. In
this case, there are two tables. Tables
contract represent subtypes of the table
employee, for example. Each permanent
or contract employee inherits certain characteristics from the
employee table, in addition to possessing
unique characteristics, as defined in their own tables.
Now let’s examine the creation of the primary keys between these
two types of tables, as it’s the primary key construct that implements
the subtype relationships . The unique key for all tables is the unique identifier
for each member of staff—the employee number. The set of primary keys of
employee is the union of the primary
keys of the various subtype tables, and the intersection of the primary
keys of all subtype tables is by construction empty, because each
employee belongs to just one, in this case, of the two categories. The
primary keys of subtype tables are also foreign keys, referencing the
primary key of
Please note that assigning totally independent primary keys to the subtype tables would, of course, be a disastrous mistake. In the real world however, you will certainly find examples in which this disastrous mistake has been perpetrated. Note also that entity sub-types are not the same as master-detail relationships. They can quickly be distinguished on examination of their respective primary keys. For those who would think that this type of discussion is a bit academic (associating with the word “academic” some vague, slightly pejorative connotation), I’ll just say that whenever different subtypes use a primary key that is not a subset of the primary key of the parent table, the result is almost invariably pathetic performance, from many points of view.
One of the main principles to follow in order to achieve efficient database access is a principle attributed to Philip II of Macedonia, father of Alexander the Great, and that principle is: Divide and Rule. It is quite likely that the vast majority of the queries executed by the HR department will belong to either of two categories: they will be either generic queries about all the people working in an organization or specific queries about one category of person. In both cases, by using subtypes correctly,[*] we will only need to examine that data which is most likely to provide the result that we require, and no time will be wasted examining irrelevant information. If we were to put everything into a single table, the most modest query would have to plow through a much greater quantity of data, most of which is useless in the context of that query.
It is always an unsound situation in which there are implicit constraints on your data—for instance “if the business line is such, then the identifier is numeric (although defined as a string of characters to accommodate other business lines),” or “if the model is T, then the color is necessarily black.” Sometimes, such general knowledge information can prove extremely efficient when filtering data. However, if it remains human knowledge, the DBMS engine, unaware of it, will be unable to take advantage of it, and the optimizer will not possess the necessary information to affect the most efficient database access. In the worst case, implicit constraints can even lead to a runtime failure. For instance, you might inadvertently require the database engine to apply an arithmetic process to a character string. This can happen when a character-defined column is used only for storing numeric data, and a non-numeric character slips in.
As an aside, the example of a string identifier that sometimes
contains character data and sometimes numerical data illustrates a
confusion over domain definitions in the initial database design. It is
quite clear that the nature of such a field varies according to
circumstances—which is totally unacceptable in a properly designed
database. If we need to store, for instance, configuration parameters of
various natures (numerical, Boolean, character, and so on), we should
not store them in a single table
parameter_value), but rather use a generic table
parameter_type) and have as many subtypes as we have parameter
types. If we use, for instance,
parameter_value is a numeric column, any
mistyping of the letter “O” instead of zero will be detected by the DBMS
when the configuration is changed, instead of resulting in a runtime
error when the parameter is used.
Define all the constraints you can. Primary keys are, of course, a sine qua non in a relational database. Use alternate key, when they characterize the data and any type of unique constraints. Foreign keys, which ensure that your data is consistent by mapping to master tables, are vital as part of the comprehensive expression of the meaning of the data model. Constraints that control the range of values that can be entered are also valuable. Constraints have two major impacts:
They contribute to ensuring the integrity of your data, guaranteeing that everything, as far as defined rules are concerned, is consistent with those rules.
They provide valuable information about your data to the DBMS kernel, and more specifically to the optimizer. Even if today the optimizer does not make full use of all available constraint data, it is likely that in future releases of the database system, that constraint data will become used for more sophisticated processing by the kernel.
The earlier example of the confusion over multiple shipping and billing addresses is a further example of the way semantic information is lost to the database by a fundamentally weak design. This essential information must therefore be placed into an unpredictable number of application programs. “If the billing address is null, then the headquarters address applies” is a rule that is unknown to the database and must therefore be handled in the programs—note the use of the plural programs here! Once again, everything that is defined in the database is defined only once, thus guaranteeing that no program will use the data inconsistently. Implicit rules about, for example, address precedence must be coded into every program accessing the data. Because these implicit rules are totally arbitrary, it is not impossible at all that in some cases the billing address will be the shipping address, and not the headquarters address.
As always, pushing a line of reasoning to the limits (and
often past them) can result in a monument to human madness. A great
favorite with third-party software editors is the
“more-flexible-than-thou” construct, in which most data of interest is
stored in some general purpose table, with equally general purpose
attributes such as:
attribute_value. In this “design,” everything
is stored as a character string into
attribute_value. The design certainly avoids
the risk of having null values. However, the proponents of this type of
design usually store the mandatory attributes in
attribute_value as well. Their mantra, by the
way, is usually that this design approach makes it easy to add new
attributes whenever they are needed. Without commenting on the quality
of a design that makes it necessary to anticipate the necessarily
haphazard addition of attributes, let’s just remark that it’s all very
nice to store data, but usually, somehow, one day you will have to
retrieve and process that same data (if data retrieval is not being
planned, there is something seriously wrong somewhere). Adding a column
to a table really pales into insignificance when compared to writing a
program to do something useful with the new bits of information that you
are given to manage (as enthusiasts that praise the flexibility of the
Extensible Markup Language [XML] are bound to understand).
The database cost of such pseudoflexibility rockets sky-high. Your database integrity is totally sacrificed, because you can hardly have a weaker way of typing your data. You cannot have any referential integrity. You cannot, in fact, have any type of declarative constraints. The simplest query becomes a monstrous join, in which the “value table” is joined 10, 15, or 20 times to the very same entity, depending on the number of attributes one wants to select. Needless to say, even the cleverest optimizer is at a loss on such a query, and performance is what one should expect—dismal. (You can try to improve the performance of such a query as described in Chapter 11, but the SQL code is not a pretty sight.) By comparison, the most inept campaign of military history looks like a masterpiece of strategic planning.
Working with historical data is an extremely common condition—the process of valuation , or specifying the price of goods or a service at a particular point in time, is based on historical data—but one of the really difficult issues of relational design is the handling of data that is associated with some period (as opposed to point) of time.
There are several ways to model historical data. Let’s assume that
we want to record the successive prices of goods identified by some
article_id. An obvious way to do so
is to store the following items:
(article_id, effective_from_date, price)
the date when the new price takes effect, and the primary key of the
historical table is
Logically correct, this type of model is rather clumsy to use when
working with current data, which in many cases will
be our main concern. How are we going to identify the current value?
It’s the one associated with the highest
effective_from_date, and it will be retrieved
by running a query looking like:
select a.article_name, h.price from articles a, price_history h where a.article_name =
some_nameand h.article_id = a.article_id and h.effective_from_date = (select max(b.effective_from_date) from price_history b where b.article_id = h.article_id)
Executing this query requires two passes over the same data: one in the inner query to identify which is the most recent date we have for a given article, and one in the outer query to return the price from a row that we have necessarily hit in the inner query (Chapter 6 talks about special functions implemented by some DBMS systems that can avoid, to some extent, multiple passes). Executing repeated queries following this pattern can prove very costly.
However, the choice of how to register the validity period for a price is arbitrary. Instead of storing the effective date from which the price applies, why not store the “end date” (e.g., the last date on which the current price prevails), identifying the time intervals by their upper bound instead of by their lower bound?
This new approach may look like an attractive solution. You have two ways to define current values—either that the end date is undefined, which looks neat but isn’t necessarily a good idea, or that the end date is something like December 31, 3000.
It’s quite obvious that looking for the price of an article as of December 31, 3000 will take you directly to the row you want, in a single pass. Definitely attractive. Is this the perfect solution? Not quite. There may be some practical worries with the optimizer, which I discuss in Chapter 6, but there is also a major logical issue: prices, as any consumer knows, rarely stay constant, and price increases are not usually decided instantly (financial environments may be something different). What happens when, for example, in October, new prices are decided for the next year and duly recorded in the database?
What we get in our valuation table are two records for each item:
one stating the current price, valid until December 31, and one giving
the price that will be applied from January 1. If we store the first
date when the price applies we will have one row with an
effective_from_date in the past (for instance
January 1 of the current year) and another one in the future (say, the
next January 1). In effect, what will define the current price is not
the highest date, but the highest date before today (returned in Oracle
by the system function
preceding query needs to be modified only slightly:
select a.article_name, h.price from articles a, price_history h where a.article_name =
some_nameand h.article_id = a.article_id and h.effective_from_date = (select max(b.effective_from_date) from price_history b where b.article_id = h.article_id and b.effective_from_date <= sysdate)
If we store the last day when the price applies, we will have one
row with an
end_date set to December
31 and another with
either to null or doomsday. Expressing that we want the price for which
end_date is the smallest date
after the current date is no obvious improvement on the query just
Denormalization is of course a possible solution—one can imagine
storing both the date when a price becomes effective and the date when
it ceases to be, or one could also argue for storing the
effective_from_date and the number of days for
applies. This could allow using either the start or the end of the
period, as best suits the query.
Denormalization always implies taking a risk with data integrity—a minor date entry error can leave black holes when no price is defined. You can of course minimize the risk by adding more checks when data is inserted or updated, but there is always a performance penalty associated with such checks.
Another possible solution is to have a current table and a historical table and plan a migration of rows from current to historical when prices change. This approach can suit some kinds of applications, but may be complicated to maintain. Moreover, the “pre-recording” of future prices fits rather badly into the picture.
In practice, particular storage techniques such as partitioning,
which I discuss in Chapter 5, will
come to the rescue, making constructs such as the one using the
effective_from_date less painful than they
might otherwise have been, especially for mass processing.
But before settling for one solution, we must acknowledge that valuation tables come in all shapes and sizes. For instance, those of telecom companies, which handle tremendous amounts of data, have a relatively short price list that doesn’t change very often. By contrast, an investment bank stores new prices for all the securities, derivatives, and any type of financial product it may be dealing with almost continuously. A good solution in one case will not necessarily be a good solution in another.
It is flattering (and a bit frightening too) to performance specialists to see the faith in their talents devotedly manifested by some developers. But, at the risk of repeating myself, I must once again stress what I said in the introduction to this book: tuning is about getting the best possible performance, now. When we develop, we must have a different mindset and not think “let’s code it, and then have a specialist tune it later in production.” The impact of tuning on the structure of programs is usually nil, and on queries, often minimal once the big mistakes have been corrected. There are indeed two aspects to this matter:
One aspect of tuning is the improvement of the overall condition of the system, by setting some parameters in accordance with the current resources in terms of CPU power, memory available, and I/O subsystems, and sometimes taking advantage of the physical implementation of the DBMS. This is a highly technical task, which may indeed improve the performance of some processes by a significant factor, but rarely by more than 20 or 30 percent unless big mistakes were made.
The other aspect of tuning is the modification of specific queries, a practice that may, unfortunately, expose the limitations of the query optimizer and changes of behavior between successive DBMS releases.
That is all there is to it.
In my view, adding indexes doesn’t really belong to the tuning of production databases (even if some tuning engagements are sometimes a matter of reviewing and correcting the indexing scheme for a database). Most indexes can and must be correctly defined from the outset as part of the designing process, and performance tests should resolve any ambiguous cases.
Performance is no more a question of making a couple of queries faster than war is a question of winning a couple of battles. You can win a battle and lose the war. You can tune your queries and nevertheless have an application with dismal performance that nobody will want to use, except at gunpoint. Your database and programs, as well as your SQL queries, must all be properly designed.
A functionally correct design is not enough. Performance must be incorporated into the design—and down-stream tuning provides for that little surplus of power that can provide peace of mind.
Besides all the questions addressed earlier in this chapter, the operating mode is also a matter that may have significant impact on the working system. What I mean by operating mode is whether data should be processed asynchronously (as is the case with batch programs ) or synchronously (as in a typical transactional program).
Batch programs are the historical ancestors of all data processing and are still very much in use today even if no longer very fashionable; synchronous processing is rarely as necessary as one might think. However, the improvement of networks and the increase in bandwidth has led to the “global reach” of an increasing number of applications. As a result, shutting down your online transaction processing (OLTP) application running in the American Midwest may become difficult because of East Asian users connected during one part of the Midwestern night and European users connected during the other part. Batch programs can no longer assume that they are running on empty machines. Moreover, ever-increasing volumes of data may require that incoming data is processed immediately rather than being allowed to accumulate into unmanageably large data sets. Processing streams of data may simply be the most efficient way to manage such quantities.
The way you process data is not without influence on the way you “think” of your system, especially in terms of physical structures—which I talk about more in Chapter 5. When you have massive batch programs, you are mostly interested in throughput—raw efficiency, using as much of the hardware resources as possible. In terms of data processing, a batch program is in the realm of brute force. When you are processing data on the fly, most activity will be small queries that are going to be repeatedly executed a tremendous number of times. For such queries, performing moderately well is not good enough—they have to perform at the maximum possible efficiency. With an asynchronous program, it is easy to notice that something is wrong (if not always easy to fix): it just takes too long to complete. With synchronous processing, the situation is much more subtle, because performance problems usually show up at the worst moment, when there are surges of activity. If you are not able to spot weaknesses early enough, your system is likely to let you down when your business reaches maximum demand levels—the very worst time to fail.
For all the talk about grids, clustered servers, and the like, spreading data across many servers means adding a considerable amount of complexity to a system. The more complicated a structure—any type of structure—the less robust it is. Technological advance does indeed slowly push up the threshold of acceptability. In the eighteenth century, clocks indicating the minutes were considered much less reliable than those indicating only the hour, and much more reliable than those showing the day in the month or the phases of the moon. But nevertheless, try to keep the theater of operations limited to that which is strictly required.
Transparent references to remote data are performance killers, for two reasons. First, however “transparent” it may look, crossing more software layers and a network has a heavy cost. To convince yourself, just run a procedure that inserts a few thousands rows into a local table, and another one doing the very same thing across—for instance, an Oracle database link, even on the same database—you can expect performance to be in the neighborhood of five times slower, if not worse, as you see demonstrated in Chapter 8.
Second, combining data from several sources is extremely difficult. When comparing data from source A to data from source B, you have no choice other than literally copying the data from A to B or the reverse. Transfer is one significant overhead. Data drawn from its own carefully constructed environment no longer benefits from the planning which went into establishing that environment (carefully thought-out physical layout, indexes, and so forth). Instead, that data lands in some temporary storage—in memory if the amount of data transferred is modest, otherwise on disk. The management of temporary storage is another major overhead. In a case where nested loops would be, in theory, the most efficient way to proceed when querying local data, an optimizer is left with two unattractive possibilities when some of the data is remotely located:
Using nested loops and incurring high overhead with each iteration
Sucking the remote data in, and then operating against the local copy, which has left all indexes behind
Optimizers can be forgiven for not performing at their best under these circumstances.
When it comes to the placement of major data repositories, some of the art is simply keeping a balance. If your company operates worldwide, keeping all the data at one location is unlikely to be a popular solution with people who live and work at the antipodes. Hitting a remote server is certainly no problem when surfing the Internet—it is quite another matter when using an application intensely. It’s not a question of bandwidth, it’s a question of light speed, for which, unfortunately, not much improvement can be expected from technological progress. Whatever you do, issuing a query against a server located on another continent adds another quarter or half second to response times, depending on the continent—and this at the best of times. If you need everyone to have the global picture, replication solutions and products (as opposed to remote access) should be contemplated. For each group of players, keep their own chessboard right at hand—don’t make players reach.
Other points to keep in mind when designing are what will happen if some piece of hardware breaks (for example, a disk controller) or if some mistake is made (for instance, the same batch program is applied twice). Even if your administrators are wizards who are doing night shifts to bring everything back on course by dawn, transfer rates are limited; the recovery of a huge database always takes a lot of time. “Spare” backup databases maintained in synch (or with some slight delay) may help. But backup databases will not be of any use in the case of a program inadvertently run twice, especially if the synchronization delay is shorter than the execution time of the program. What is already complicated with one database becomes a nightmare with several related databases, because you must be perfectly certain that all the databases are correctly synchronized after any recovery, to avoid any risk of data corruption.
This particular point of recovery is often a bone of contention between developers and database administrators, because developers tend to consider, not unreasonably, that backups and recoveries belong to administrators, while administrators point out, logically, that if they can guarantee that the container is in working order, they have no idea about the status of the contents. Indeed, any functional check in case of recovery should not be forgotten by developers. The more complicated the overall design, the more important it is for developers to keep in mind the constraints of operations.
We have reviewed the basic foundations for laying plans in constructing a database system. We have reviewed the fundamentals of data modeling , and in particular the broad steps involved in normalizing data to third normal form. We have then proceeded to review a number of scenarios, in which a faulty design can be identified as the road to disaster.
Most examples in this chapter come directly from or are inspired by cases I have encountered in some big companies. And it is always striking to consider how much energy and intelligence can be wasted trying to solve performance problems that are born from the ignorance of elementary design principles. Such performance issues need not be present, yet they are quite common and often made worse by further denormalization of what is already a questionable design, on the unassailable grounds of “performance improvement.” One query may, in fact, run much faster, but unfortunately, the nightly batch program now takes twice as long. In this way, and almost without being noticed, a full information system is built on a foundation of sand.
[*] The expression business requirement is meant to encompass non-commercial as well as commercial activities.
[*] You can have 3NF but not BCNF if your table contains several sets of columns that are unique (candidate keys, which are possible unique identifiers of a row) and share one column. Such situations are not very common.
[*] You can use subtypes incorrectly. As one of the reviewers remarked, having a kind of super-generic parent table that is referred to several times in the most innocuous query isn’t a model for efficiency. Such a super-generic parent table is hammered by all queries if it stores vital information. Subtypes must be born of logical distinction, not of an ill-conceived desire to implement with tables a strong inheritance scheme inspired from object-oriented techniques.