In Chapter 1, we tried to present a convincing case for why most databases should be modeled as relational databases, rather than single-table flat databases. We tried to make it clear why we split the single LIBRARY_FLAT table into four separate tables: AUTHORS, BOOKS, PUBLISHERS, and BOOK/AUTHOR.
However, for large real-life databases, it is not always clear how to split the data into multiple tables. As we mentioned in Chapter 1, the goal is to do this in such a way as to minimize redundancy, without losing any information.
The problem of effective database design is a complex one. Most people consider it an art rather than a science. This means that intuition plays a major role in good design. Nonetheless, there is a considerable theory of database design, and it can be quite complicated. Our goal in this chapter is to touch upon the general ideas, without becoming involved in the details. Hopefully, this discussion will provide a helpful guide to the intuition needed for database design.
As we saw in Chapter 1, redundant data tends to inflate the size of a database, which can be a very serious problem for medium to large databases. Moreover, redundancy can lead to several types of anomalies, as discussed earlier. To understand the problems that can arise from redundancy, we need to take a closer look at what redundancy means.
Let us begin by observing that the attributes of a table scheme can be classified into three groups:
Attributes used strictly for identification purposes
Attributes used strictly for informational purposes
Attributes used for both identification and informational purposes
For example, consider the table scheme:
{PubID,PubName,PubPhone,YearFounded}
In this scheme, PubID is used strictly for identification purposes. It carries no informational content. On the other hand, YearFounded is strictly for informational purposes in this context. It gives the year that the publishing company was founded, but is not required for identification purposes.
Consider also the table scheme:
{Title,PubID,AuID,PageCount,CopyrightDate}
In this case, if we assume that there is only one book of a given title published by a given publisher and written by a given author, then {Title,PubID,AuID} is a key. Hence, each of these attributes is used (at least in part) for identification. However, Title is also an informational attribute.
We should hasten to add that these classifications are somewhat subjective, and depend upon the assumptions made about the entity class. Nevertheless, this classification does provide a useful intuitive framework.
We can at least pin down the strictly informational attributes a bit more precisely by making the following observation. The sign that an attribute is being used (at least in part) for identification purposes is that it is part of some key. Thus, an attribute that is not part of any key is being used, in that table scheme, strictly for informational purposes. Let us call such an attribute a strictly informational attribute.
Now consider the table shown in Table 4.1. In this case, both Title and PubName are strictly informational, since {ISBN} is the only key, and neither Title nor PubName is part of that key. However, the values of Title are not redundant (the fact that they are the same does not mean that they are not both required), whereas the values of PubName are redundant.
Table 4-1. A Table with Two Informational Attributes
ISBN |
Title |
PubID |
PubName |
---|---|---|---|
1-1111-1111-1 |
C++ |
1 |
Big House |
0-91-335678-7 |
Faerie Queene |
1 |
Big House |
1-011-22222-0 |
C++ |
2 |
ABC Press |
The reason that Title is not redundant is that there is no way to eliminate any of these titles. Each book entity must have its title listed somewhere in the database—one title per ISBN. Thus, the two titles C++ must both appear somewhere in the database.
On the other hand, PubName is redundant, as can easily be seen from the fact that the same PubName is listed twice without adding any new information to the database. To look at this another way, consider the table with two cells blank in Table 4.2. Can you fill in the title field for the last row? Not unless you call the publisher to get the title for that ISBN. In other words, some information is missing. On the other hand, you can fill in the blank PubName field.
Table 4-2. A Table with Blank Cells to Illustrate Attribute Dependency
ISBN |
Title |
PubID |
PubName |
---|---|---|---|
1-1111-1111-1 |
Macbeth |
1 |
Big House |
2-2222-2222-2 |
Hamlet |
1 | |
5-555-55555-5 |
2 |
ABC Press |
The issue here is quite simple. The Title attribute depends only upon the ISBN attribute and {ISBN} is a key. In other words, Title depends only upon a key. However, PubName depends completely upon PubID, which is not a key for this table scheme. (Of course, PubName also depends on the key {ISBN}, but that is not relevant.)
Thus, we have seen a case where redundancy results from the fact that one attribute depends upon another attribute that is not a key. Armed with this observation, we can move ahead.
Get Access Database Design and Programming, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.