Chapter 4. Database Design Principles
In Chapter 1 I tried to present a convincing case for why most databases should be modeled as relational databases, rather than single-table flat databases. I tried to make it clear why I split the single LIBRARY_FLAT table into four separate tables: AUTHORS, BOOKS, PUBLISHERS, and BOOK/AUTHOR.
However, for large real-life databases, it is not always clear how to split the data into multiple tables. As I mentioned in Chapter 1, the goal is to minimize redundancy, without losing any information.
The problem of effective database design is a complex one. Most people consider it an art rather than a science. This means that intuition plays a major role in good design. Nonetheless, there is a considerable theory of database design, and it can be quite complicated. My goal in this chapter is to touch upon the general ideas, without becoming involved in the details. Hopefully, this discussion will provide a helpful guide to the intuition needed for database design.
Redundancy
As we saw in Chapter 1, redundant data tends to inflate the size of a database, which can be a very serious problem for medium to large databases. Moreover, redundancy can lead to several types of anomalies, as discussed earlier. To understand the problems that can arise from redundancy, we need to take a closer look at what redundancy means.
Let us begin by observing that the attributes of a table scheme can be classified into three groups:
Attributes used strictly for identification purposes
Attributes used strictly for informational purposes
Attributes used for both identification and informational purposes
For example, consider the table scheme:
{PubID,PubName,PubPhone,YearFounded}
In this scheme, PubID is used strictly for identification purposes. It carries no informational content. On the other hand, YearFounded is strictly for informational purposes in this context. It gives the year that the publishing company was founded, but is not required for identification purposes.
Consider also the table scheme:
{Title,PubID,AuID,PageCount,CopyrightDate}
In this case, if we assume that there is only one book of a given title published by a given publisher and written by a given author, then {Title,PubID,AuID} is a key. Hence, each of these attributes is used (at least in part) for identification. However, Title is also an informational attribute.
I should hasten to add that these classifications are somewhat subjective and depend upon the assumptions made about the entity class. Nevertheless, this classification does provide a useful intuitive framework.
We can at least pin down the strictly informational attributes a bit more precisely by making the following observation. The sign that an attribute is being used (at least in part) for identification purposes is that it is part of some key. Thus, an attribute that is not part of any key is being used, in that table scheme, strictly for informational purposes. Let us call such an attribute a strictly informational attribute.
Now consider Table 4-1. In this case, both Title and PubName are strictly informational, since {ISBN} is the only key, and neither Title nor PubName is part of that key. However, the values of Title are not redundant (the fact that they are the same does not mean that they are not both required), whereas the values of PubName are redundant.
ISBN | Title | PubID | PubName |
1-1111-1111-1 | C++ | 1 | Big House |
0-91-335678-7 | Faerie Queene | 1 | Big House |
1-011-22222-0 | C++ | 2 | ABC Press |
The reason that Title is not redundant is that there is no way to eliminate any of these titles. Each book entity must have its title listed somewhere in the database—one title per ISBN. Thus, the two titles C++ must both appear somewhere in the database.
On the other hand, PubName is redundant, as can easily be seen from the fact that the same PubName is listed twice without adding any new information to the database. To look at this another way, consider the table with two cells blank in Table 4-2. Can you fill in the title field for the last row? Not unless you call the publisher to get the title for that ISBN. In other words, some information is missing. On the other hand, you can fill in the blank PubName field.
ISBN | Title | PubID | PubName |
1-1111-1111-1 | Macbeth | 1 | Big House |
2-2222-2222-2 | Hamlet | 1 | |
5-555-55555-5 | 2 | ABC Press |
The issue here is quite simple. The Title attribute depends only upon the ISBN attribute, and {ISBN} is a key. In other words, Title depends only upon a key. However, PubName depends completely upon PubID, which is not a key for this table scheme. (Of course, PubName also depends on the key {ISBN}, but that is not relevant.)
Thus, we have seen a case where redundancy results from the fact that one attribute depends upon another attribute that is not a key. Armed with this observation, we can move ahead.
Normal Forms
Those who make a study of database design have identified a number of special forms, properties, or constraints that a table scheme may possess, in order to achieve certain desired goals, such as minimizing redundancy. These forms are called normal forms. There are six commonly recognized normal forms, with the inspired names:
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
Boyce Codd normal form (BCNF)
Fourth-normal form (4NF)
Fifth normal form (5NF)
We will consider the first four of these normal forms, but only informally. Each of these normal forms is stronger than its predecessors. Thus, for instance, a table scheme that is in third normal form is also in second normal form. While it is generally desirable for the table schemes in a database to have a high degree of normalization, as we will see in this chapter, the situation is not as simple as it may seem.
For instance, requiring that all table schemes be in BCNF may cause some loss of information about the various relationships between the table schemes. In general, it is possible to manipulate the data to achieve third normal form for all table schemes, but this may turn out to be far more work than it is worth.
The plain fact is that forcing all table schemes to be in a particular normal form may require some compromises. Each individual situation (database) must be examined on its own merit. It is impossible to make general rules that apply in all situations.
The process of changing a database design to produce table schemes in normal form is called normalization .
First Normal Form
First normal form is very simple. A table scheme is said to be in first normal form if the attribute values are indivisible. To illustrate, we considered in Chapter 1 the question of including all the authors of a book in a single attribute, called Authors. Here is an example entity:
ISBN = 0-55-123456-9 Title = Main Street Authors = Jones, H. and Smith, K. Publisher = Small House
Since the table scheme in this case allows more than one author name for the Authors attribute, the scheme is not in first normal form. Indeed, one of the obvious problems with the Authors attribute is that it is impossible to sort the data by individual author name. It is also more difficult to, for instance, prepare a mailing label for each author, and so on.
Attributes that allow only indivisible values are said to be scalar attributes or atomic attributes. By contrast, an attribute whose values can be, for example, a list of items (such as a list of authors) is said to be a structured attribute . Thus, a table scheme is in first normal form if all of its attributes are atomic. Good database design almost always requires that all attributes be atomic, so that the table scheme is in first normal form.
In general, making the adjustments necessary to ensure first normal form is not hard, and it is a good general rule that table schemes should be put in first normal form. However, as with the other normal forms (and even more so the higher up we go) each situation must be considered on its own merits. For instance, a single field might be designed to hold a street address, such as “1333 Bessemer Street.” Whether the house number and the street name should be separated into distinct attributes is a matter of context. Put another way, whether a street address is atomic depends upon the context. If there is reason to manipulate the street numbers apart from the street names, then they should certainly constitute their own attribute. Otherwise, perhaps not.
Functional Dependencies
Before we can discuss the other normal forms, we need to discuss the concept of functional dependency , which is used to define these normal forms. This concept is quite simple, and we have actually been using it for some time now. As an example, we have remarked that, for the Publishers table scheme, the PubName attribute depends completely on the PubID attribute. (More properly, we should say that the value of the PubName attribute depends completely on the value of the PubID attribute, but the earlier shorthand is convenient.) Thus, we can say that the functional dependency from PubID to PubName, written:
PubID → PubName
holds for the Publishers table scheme. This can be read “PubID determines PubName” or “PubName depends on PubID.”
More generally, suppose that {A1,...,Ak} are attributes of a table scheme and that {B1,...,Bn} are also attributes of the same table scheme. We do not require that the Bs be different from the As. Then the attributes B1,...,Bn depend on the attributes A1,...,Ak, written:
{A1,...,Ak}
→
{B1,...,Bn}
if the values of A1,...,Ak completely determine the values of B1,...,Bn. Our main interest is when there is only one attribute on the right:
{A1,...,Ak}
→ {B}
For instance, it is probably safe to say that:
{PubName,PubPhone} → {PubID}
which is just another way of saying that there is only one publisher with a given name and phone number (including area code).
It is very important to understand that a functional dependency means that the attributes on the left completely determine the attributes on the right for now and for all time to come, no matter what additional data may be added to the database. Thus, just as the concept of a key relates to entity classes (table schemes) rather than individual entity sets (tables), so does functional dependency. Every table scheme has its set of associated functional dependencies, which are based on the meaning of the attributes.
Recall that a superkey is a set of attributes that uniquely determines an entity. Put another way, a superkey is a set of attributes upon which all other attributes of the table scheme are functionally dependent.
Some functional dependencies are obvious. For instance, an attribute functionally depends upon itself. Also, any set of attributes functionally determines any subset of these attributes, as in:
{A,B,C} → {A,B}
This just says that if we know the values of A, B, and C, then we know the value of A and B! Such functional dependencies are not at all interesting, and are called trivial dependencies . All other dependencies are called nontrivial.
Second Normal Form
Intuitively, a table scheme T is in second normal form if all of the strictly informational attributes (attributes that do not belong to any key) are attributes of the entities in the table scheme, and not of some other class of entities. In other words, the informational attributes provide information specifically about the entities in this entity class and not about some other entities.
Let us illustrate with an example. Consider a simplified table scheme designed to store house addresses. One possibility is:
{City,Street,HouseNumber,HouseColor,CityPopulation}
The CityPopulation attribute is out of place here because it is an attribute of cities, not house addresses. More specifically, CityPopulation is strictly an informational attribute (not for identification of houses), but it gives information about cities, not house addresses. Thus, this table scheme is not in second normal form.
We can be a little bit more formal about the meaning of second normal form as follows. Referring to the previous example, we have the dependency:
{City} → {CityPopulation}
where CityPopulation does not belong to any key, and where City is a proper subset of a key, namely, the key {City, Street, HouseNumber}. (By proper subset, we mean a subset that is not the whole set.)
A table scheme is in 2NF if it is not possible to have a dependency of the form:
{A1,...,Ak}
→ {B}
where B does not belong to any key (is strictly informational) and {A1,...,Ak} is a propersubset of some key, and thus does not identify the entities of this entity class, but rather identifies the entities of some other entity class.
Let us consider another example of a table scheme that is not in second normal form.
Consider the following table scheme, and assume for the purposes of illustration that, while there may be many books with the same title, no two of them have the same publisher and author:
{Title,PubID,AuID,Price,AuAddress}
Thus, {Title, PubID, AuID} is the only key. Now, AuAddress does not belong to any key, but it depends upon {AuID}, which is a proper subset of the key, in symbols:
{AuID} → {AuAddress}
Hence, this table scheme is not in second normal form. In fact, AuAddress is not a piece of information about the entities modeled in the table scheme (i.e., books), but rather about authors. Of course, we could remove the AuAddress attribute to bring the table scheme into second normal form. (If each publisher charged a single price for all of its books, then Price would also cause a violation of second normal form, but this is not the case, of course.)
Third Normal Form
Second normal form is good, but we can do better. We have seen that if a table scheme is in second normal form, then no strictly informational attribute depends on a proper subset of a key. However, there is another undesirable possibility. Let us illustrate with an example.
Consider the following table scheme and assume, for the purposes of illustration, that no two books with the same title have the same publisher:
{Title,PubID,PageCount,Price}
The only key for this table scheme is {Title,PubID}. Both PageCount and Price are informational attributes only.
Now, let us assume that each publisher decides the price of its books based solely on the page count. First, we observe that this table is in second normal form. To see this, consider the proper subsets of the key. These are:
{Title} and {PubID}
But none of the dependencies:
{Title} → {PageCount} |
{Title} → {Price} |
{PubID} → {PageCount} |
{PubID} → {Price} |
hold for this table scheme. After all, knowing the title does not determine the book, since there may be many books of the same title, published by different publishers. Hence, the table is in second normal form.
It is also not correct to say that:
{PageCount} → {Price}
holds, because different publishers may use different price schemes based on page count. In other words, one publisher may price books over 1,000 pages at one price, whereas another may price books over 1,000 pages at a different price. However, it is true that:
{PubID,PageCount} → {Price}
holds. In other words, here we have an informational attribute (Price) that depends not on a proper subset of a key, but on a proper subset of a key (PubID) together with another informational attribute (PageCount).
This is bad, since it may produce redundancy. For instance, consider Table 4-3. Note that the price attribute is redundant. After all, we could fill in the Price value for the third row if it were blank, because we know that PubID 2 charges $34.95 for 500-page books.
Title | PubID | PageCount | Price |
Moby-Dick | 1 | 500 | 29.95 |
Giant | 2 | 500 | 34.95 |
Moby-Dick | 2 | 500 | 34.95 |
We can summarize the problem with the dependency:
{PubID,PageCount} → {Price}
by saying that the attribute Price depends upon a set of attributes:
{PubID,PageCount}
that is not a key, not a superkey, and not a proper subset of a key. It is a mix containing one attribute from the key {Title,PubID} and one attribute that is not in any key.
With this example in mind, we can now define third normal form. A table scheme is in third normal form if it is not possible to have a dependency of the form:
{A1,...,Ak}
→ {B}
where B does not belong to any key (is strictly informational) and {A1,...,Ak} is not a superkey. In other words, third normal form does not permit any strictly informational attribute to depend upon anything other than a superkey. Of course, superkeys determine all attributes, including strictly informational attributes, and so all attributes depend on any superkey. The point is that, with third normal form, strictly informational attributes depend only on superkeys.
Boyce-Codd Normal Form
It is possible to find table schemes that are in third normal form, but still have redundancy. Here is an example.
Consider the table scheme {City,StreetName,ZipCode}, with dependencies:
{City,StreetName} → {ZipCode}
and:
{ZipCode} → {City}
(Although in real life, a zip code may be shared by two different cities, we will assume otherwise for the purposes of illustration.) This table scheme is in third normal form. To see this, observe that the keys are {City,StreetName} and {ZipCode,StreetName}. Hence, no attribute is strictly informational, and there is nothing to violate third normal form.
On the other hand, consider Table 4-4. We can fill in the blank city name because {ZipCode}→{City}.
City | StreetName | ZipCode |
Los Angeles | Hollywood Blvd | 95000 |
Vine St | 95000 |
The problem here is with the dependency:
{ZipCode}→{City}
which does not violate third normal form because, as we have mentioned, {City} is not strictly informational.
The previous example gives us the idea to strengthen the condition in the definition of third normal form by dropping the requirement that B be strictly informational. Thus, we can define our last, and strongest, normal form. A table scheme is in Boyce-Codd normal form if it is not possible to have a dependency of the form:
{A1,...,Ak}
→ {B}
where {A1,...,Ak} is not a superkey. In other words, BCNF form does not permit any attribute to depend upon anything other than a superkey.
As mentioned earlier, all attributes must depend on any superkey by the very definition of superkey. Thus, BCNF is the strongest possible restriction of this type—it says that an attribute is not allowed to depend on anything other than a superkey.
Normalization
As mentioned earlier, the process of changing a database design to produce table schemes in normal form is called normalization.
As a very simple example, the table scheme:
{ISBN,Title,Authors}
is not even in first normal form, because the Authors attribute might contain more than one author and is therefore not atomic. By trading in this table scheme for the two schemes:
{ISBN,Title,AuID}
and:
{AuID,AuName}
we have normalized the database into first normal form.
Here is another example involving the higher normal forms. Recall from an earlier example that the table scheme {City,StreetName,ZipCode}, with dependencies:
{City,StreetName} → {ZipCode}
and:
{ZipCode} → {City}
is in third normal form. However, Table 4-5 shows that there is still some redundancy in the table scheme. The table scheme is not in BCNF. In fact, this was the example we used to motivate our definition of BCNF. (The example violates BCNF.)
City | StreetName | ZipCode |
Los Angeles | Hollywood Blvd | 95000 |
Vine St | 95000 |
However, we can split this table scheme into two schemes:
{ZipCode,City}
and:
{ZipCode,StreetName}
In this case, Table 4-5 gets split into two tables, Tables Table 4-6 and Table 4-7, and the redundancy is gone!
ZipCode | StreetName |
95000 | Hollywood Blvd |
95000 | Vine St |
Generally speaking, the design of a database may begin with an E/R diagram. This diagram can be implemented according to the principles discussed in Chapter 3. The result may very well be a perfectly satisfactory database design. However, if some of the table schemes have redundancies, it may be desirable to split them into smaller table schemes that satisfy a higher normal form, as in the previous example.
Decomposition
Although the decomposition of a table scheme into smaller (hopefully normalized) table schemes is desirable from an efficiency point of view (in order to reduce redundancy and avoid various anomalies), it does carry with it some risk, which primarily comes in two forms:
The possible loss of information
The possible loss of dependencies
The following example illustrates the first problem—loss of information. Consider the table scheme:
{AuID,AuName,PubID}
The only dependency in this table scheme is:
{AuID} → {AuName}
We could decompose this table scheme into the two schemes:
{AuID,AuName}
and:
{AuName,PubID}
Now consider Table 4-8, which has two different authors with the same name. The decomposition gives the two tables shown in Tables Table 4-9 and Table 4-10.
Unfortunately, if we were to ask Microsoft Access to show us the data for all authors named John Smith, we would get the table shown in Table 4-11, which is not the table we started with! Information has been lost, in the sense that we no longer know that both John Smiths together have published only two books, each author with a different publisher. (It may look as though we have more information, since the table is bigger, but in reality we have lost information.)
AuID | AuName | PubID |
A1 | John Smith | P1 |
A1 | John Smith | P2 |
A2 | John Smith | P1 |
A2 | John Smith | P2 |
The second problem I mentioned in connection with the decomposition of a table scheme is loss of dependencies. The issue is this: during the life of the database, we will be making changes (updates, insertions, and deletions) to the separate tables in the decomposition. Of course, we must be careful to preserve the functional dependencies that are inherited from the original table scheme. However, this does not necessarily guarantee that all of the original dependencies will be preserved!
Here is a simple example to illustrate the problem. Consider the table scheme:
{ISBN,PageCount,Price}
with dependencies:
{ISBN} → {PageCount} |
{PageCount} → {Price} |
Consider the decomposition into the table schemes:
{ISBN,PageCount}
and:
{ISBN,Price}
Note that the key {ISBN} is in both schemes in the decomposition.
Unfortunately, the decomposition has caused us to lose the dependency {PageCount}→{Price}, in the sense that these two attributes are not in the same table scheme of the decomposition. To illustrate, consider Table 4-12, which has two different books with the same page count and price. The decomposition of this table into two tables is shown in Tables Table 4-13 and Table 4-14.
ISBN | PageCount | Price |
0-111-11111-1 | 500 | $39.95 |
0-111-22222-2 | 500 | $39.95 |
Now here is the problem. Looking at the second table, we have no indication that the original scheme required that PageCount determine Price. Hence, we might change the price of the second book to $12.50, as we’ve done in Table 4-15.
But putting the tables back together for a look at all of the data gives us Table 4-16, which reveals a violation of the requirement that PageCount determine Price. In fact, somebody at the publishing company is going to be very unhappy that the company is now selling a 500-page book below cost!
ISBN | PageCount | Price |
0-111-11111-1 | 500 | $39.95 |
0-111-22222-2 | 500 | $12.50 |
By contrast, consider the decomposition of the original table scheme into:
{ISBN,PubPhone}
and:
{PubPhone,PubName}
Here, no dependency is lost, so we can update each separate table without fear.
The previous two examples illustrate the pitfalls in decomposing a table scheme into smaller schemes. If a decomposition does not cause any information to be lost, it is called a lossless decomposition. A decomposition that does not cause any dependencies to be lost is called a dependency-preserving decomposition.
Now it is possible to show that any table scheme can be decomposed, in a lossless way, into a collection of smaller schemes that are in the very nice BCNF form. However, we cannot guarantee that the decomposition will preserve dependencies. On the other hand, any table scheme can be decomposed—in a lossless way that also preserves dependencies—into a collection of smaller schemes that are in the almost-as-nice third normal form.
However, before you get too excited, I must hasten to add that the algorithms given do not always produce desirable results. They can, in fact, create decompositions that are less intuitive than we might do just using our intuition. Nevertheless, they can be relied upon to produce the required decomposition, if we can’t do it ourselves.
I should conclude by saying that there is no law that says that a database is always more useful or efficient if the tables have a high degree of normalization. These issues are more subjective than objective and must be dealt with, as a design issue, on an ad hoc basis. In fact, it appears that the best procedure for good database design is to mix eight parts intuition and experience with two parts theory. Hopefully, discussion of normalization has given you a general feel of the issues involved and will provide a good jumping-off place if you decide to study these somewhat complicated issues in greater depth. (See Appendix E for some books for further study.)
Get Access Database Design & Programming, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.