Normalization

As we mentioned earlier, the process of changing a database design to produce table schemes in normal form is called normalization.

As a very simple example, the table scheme:

{ISBN,Title,Authors}

is not even in first normal form, because the Authors attribute might contain more than one author and is therefore not atomic. By trading in this table scheme for the two schemes:

{ISBN,Title,AuID} and {AuID,AuName}

we have normalized the database into first normal form.

Here is another example involving the higher normal forms.

Recall from an earlier example that the table scheme {City,StreetName,ZipCode}, with dependencies:

{City,StreetName}  {ZipCode}

and:

{ZipCode}  {City}

is in third normal form. However, Table 4.5 shows that there is still some redundancy in the table scheme. The table scheme is not in BCNF. In fact, this was the example we used to motivate our definition of BCNF. (The example violates BCNF.)

Table 4-5. A Table with Redundant Data

City

StreetName

ZipCode

Los Angeles

Hollywood Blvd

95000

 

Vine St

95000

However, we can split this table scheme into two schemes:

{ZipCode,City}

and:

{ZipCode,StreetName}

In this case, Table 4.5 gets split into two tables, Table 4.6 and Table 4.7, and the redundancy is gone!

Table 4-6. First Table Derived from Table 4.5 to Eliminate Redundancy

ZipCode

City

95000

Los Angeles

Table 4-7. Second Table Derived from Table 4.5 to Eliminate Redundancy

ZipCode

StreetName

95000

Hollywood Blvd

95000

Vine St

Generally speaking, the design of a database may begin with an E/R diagram. This diagram can be implemented according to the principles that we discussed in Chapter 3. The result may very well be a perfectly satisfactory database design. However, if some of the table schemes have redundancies, it may be desirable to split them into smaller table schemes that satisfy a higher normal form, as in the previous example.

Decomposition

Although the decomposition of a table scheme into smaller (hopefully normalized) table schemes is desirable from an efficiency point of view, in order to reduce redundancy and avoid various anomalies, it does carry with it some risk, which primarily comes in two forms:

  • The possible loss of information

  • The possible loss of dependencies

The following example illustrates the first problem—loss of information.

Consider the table scheme:

{AuID,AuName,PubID}

The only dependency in this table scheme is:

{AuID}  {AuName}

We could decompose this table scheme into the two schemes:

{AuID,AuName} and {AuName,PubID}

Now consider Table 4.8, which has two different authors with the same name. The decomposition gives the two tables shown in Table 4.9 and Table 4.10.

Table 4-8. A Table with Two Identical Author Names

AuID

AuName

PubID

A1

John Smith

P1

A2

John Smith

P2

Table 4-9. Partial Decomposition of Table 4.8

AuID

AuName

A1

John Smith

A2

John Smith

Table 4-10. Partial Decomposition of Table 4.8

AuName

PubID

John Smith

P1

John Smith

P2

Unfortunately, if we were to ask Microsoft Access to show us the data for all authors named John Smith, we would get the table shown in Table 4.11, which is not the table we started with! Information has been lost, in the sense that we no longer know that both John Smiths together have published only two books, each author with a different publisher. (It may look as though we have more information, since the table is bigger, but in reality we have lost information.)

Table 4-11. An Incorrect Reconstruction of Table 4.8

AuID

AuName

PubID

A1

John Smith

P1

A1

John Smith

P2

A2

John Smith

P1

A2

John Smith

P2

The second problem we mentioned in connection with the decomposition of a table scheme is that of loss of dependencies. The issue is this: During the life of the database, we will be making changes (updates, insertions, and deletions) to the separate tables in the decomposition. Of course, we must be careful to preserve the functional dependencies that are inherited from the original table scheme. However, this does not necessarily guarantee that all of the original dependencies will be preserved!

Here is a simple example to illustrate the problem. Consider the table scheme:

{ISBN,PageCount,Price}

with dependencies:

{ISBN}  {PageCount}
{PageCount}  {Price}

Consider the decomposition into the table schemes:

{ISBN,PageCount} and {ISBN,Price}

Note that the key {ISBN} is in both schemes in the decomposition.

Unfortunately, the decomposition has caused us to lose the dependency {PageCount}{Price}, in the sense that these two attributes are not in the same table scheme of the decomposition. To illustrate, consider Table 4.12, which has two different books with the same page count and price. The decomposition of this table into two tables is shown in Table 4.13 and Table 4.14.

Table 4-12. Table Example to Show Further Decomposition

ISBN

PageCount

Price

0-111-11111-1

500

$39.95

0-111-22222-2

500

$39.95

Table 4-13. Partial Decomposition of Table 4.12

ISBN

PageCount

0-111-11111-1

500

0-111-22222-2

500

Table 4-14. Partial Decomposition of Table 4.12

ISBN

Price

0-111-11111-1

$39.95

0-111-22222-2

$39.95

Now here is the problem. Looking at the second table, we have no indication that the original scheme required that PageCount determines Price. Hence, we might change the price of the second book to $12.50, as we’ve done in Table 4.15.

Table 4-15. Decomposition Example Changing Price

ISBN

Price

0-111-11111-1

$39.95

0-111-22222-2

$12.50

But putting the tables back together for a look at all of the data gives us Table 4.16, which reveals a violation of the requirement that PageCount determines Price. In fact, somebody at the publishing company is going to be very unhappy that the company is now selling a 500-page book at below cost!

Table 4-16. Looking at Data by Combining Tables Table 4.12 Through Table 4.15

ISBN

PageCount

Price

0-111-11111-1

500

$39.95

0-111-22222-2

500

$12.50

By contrast, consider the decomposition of the original table scheme into:

{ISBN,PubPhone} and {PubPhone,PubName}

Here, no dependency is lost, so we can update each separate table without fear.

The previous two examples illustrate the pitfalls in decomposing a table scheme into smaller schemes. If a decomposition does not cause any information to be lost, it is called a lossless decomposition. A decomposition that does not cause any dependencies to be lost is called a dependency-preserving decomposition.

Now it is possible to show that any table scheme can be decomposed, in a lossless way, into a collection of smaller schemes that are in the very nice BCNF form. However, we cannot guarantee that the decomposition will preserve dependencies. On the other hand, any table scheme can be decomposed, in a lossless way that also preserves dependencies, into a collection of smaller schemes that are in the almost-as-nice third normal form.

However, before getting too excited, we must hasten to add that the algorithms that we give do not always produce desirable results. They can, in fact, create decompositions that are less intuitive than we might do just using our intuition. Nevertheless, they can be relied upon to produce the required decomposition, if we can’t do it ourselves.

We should conclude by saying that there is no law that says that a database is always more useful or efficient if the tables have a high degree of normalization. These issues are more subjective than objective and must be dealt with, as a design issue, on an ad hoc basis. In fact, it appears that the best procedure for good database design is to mix eight parts intuition and experience with two parts theory. Hopefully, our discussion of normalization has given you a general feeling for the issues involved, and will provide a good jumping-off place if you decide to study these somewhat complicated issues in greater depth. (See Appendix E, for some books for further study.)

Get Access Database Design and Programming, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.