WHAT’S WRONG WITH DUPLICATES?

There are numerous practical arguments in support of the position that duplicate rows (“duplicates” for short) should be prohibited. Here I want to emphasize just one—but I think it’s a powerful one.[50] However, it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

  1. I assume you know that relational DBMSs include a component called the optimizer,[51] whose job is to try to figure out the best way to implement user queries and the like (where “best” basically means best performing).

  2. I assume you also know that one of the things optimizers do is what’s sometimes called query rewrite. Query rewrite is the process of transforming some relational expression exp1 (representing some user query, say) into another such expression exp2, such that exp1 and exp2 are guaranteed to produce the same result when evaluated but exp2 performs better than exp1 (at least, we hope so). Note: Be aware, however, that the term query rewrite is also used in certain commercial products with a different (typically more limited) meaning.

Now I can present my argument. The fundamental point I want to make is that certain expression transformations, and hence certain optimizations, that would be valid if SQL were truly relational aren’t valid in the presence of duplicates. By way of example, consider the (nonrelational) database shown in Figure 4-1. Note right away that the tables in that database have no keys (which is why there’s no double underlining in the figure). And by the way: If you’re thinking the database is unrealistic—and especially if you’re thinking you’re not going to be convinced by the arguments that follow, therefore—please see the further remarks on this example at the beginning of the next section.

A nonrelational database, with duplicates

Figure 4-1. A nonrelational database, with duplicates

Before going any further, perhaps I should ask the question: What does it mean to have three (P1,Screw) rows in table P and not two, or four, or seventeen? It must mean something, for if it means nothing, then why are the duplicates there in the first place? As I once heard Ted Codd say: If something is true, saying it twice doesn’t make it any more true.[52]

So I have to assume there’s some meaning attached to the duplication, even though that meaning, whatever it is, is hardly very explicit.[53] Given that duplicates do have some meaning, therefore, there are presumably going to be business decisions made on the basis of the fact that, for example, there are three (P1,Screw) rows in table P and not two or four or seventeen. For if not, then (to repeat) why are the duplicates there in the first place?

Now consider the following query on the database of Figure 4-1: “Get part numbers for parts that either are screws or are supplied by supplier S1, or both.” Here are some candidate SQL formulations for this query, together with the result produced in each case:

  1. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 3, P2 * 1.

  2. SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'
    OR     SP.PNO IN
         ( SELECT P.PNO
           FROM   P
           WHERE  P.PNAME = 'Screw' )

    Result: P1 * 2, P2 * 1.

  3. SELECT P.PNO
    FROM   P , SP
    WHERE  ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 9, P2 * 3.

  4. SELECT SP.PNO
    FROM   P , SP
    WHERE  ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 8, P2 * 4.

  5. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 5, P2 * 2.

  6. SELECT DISTINCT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 3, P2 * 2.

  7. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT DISTINCT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 4, P2 * 2.

  8. SELECT DISTINCT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 1, P2 * 1.

  9. SELECT DISTINCT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'
    OR     SP.PNO IN
         ( SELECT P.PNO
           FROM   P
           WHERE  P.PNAME = 'Screw' )

    Result: P1 * 1, P2 * 1.

  10. SELECT P.PNO
    FROM   P
    GROUP  BY P.PNO , P.PNAME
    HAVING P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 1, P2 * 1.

  11. SELECT P.PNO
    FROM   P , SP
    GROUP  BY P.PNO , P.PNAME , SP.SNO , SP.PNO
    HAVING ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 2, P2 * 2.

  12. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 1, P2 * 1.

    Aside: Actually, certain of the foregoing formulations—which?—are a little suspect, because they effectively assume that every screw is supplied by at least one supplier. But this fact makes no material difference to the argument that follows. End of aside.

The first point to notice, then, is that the twelve different formulations produce nine different results: different, that is, with respect to their degree of duplication. (By the way, I make no claim that the twelve different formulations and the nine different results are the only ones possible; indeed, they aren’t, in general.) Thus, if the user really cares about duplicates, then he or she needs to be extremely careful in formulating the query in such a way as to obtain exactly the desired result.

Furthermore, analogous remarks apply to the system itself: Because different formulations can produce different results, the optimizer too has to be extremely careful in its task of expression transformation. For example, the optimizer isn’t free to transform, say, formulation 1 into formulation 12 or the other way around, even if it would like to. In other words, duplicate rows act as a significant optimization inhibitor. Here are some implications of this fact:

  • The optimizer code itself is harder to write, harder to maintain, and probably more buggy—all of which combine to make the product more expensive and less reliable, as well as later in delivery to the marketplace, than it might be.

  • System performance is likely to be worse than it might be.

  • Users are going to have to get involved in performance issues. To be more specific, they’re going to have to spend time and effort in figuring out how to formulate a given query in order to get the best performance—a state of affairs that (as noted in Chapter 1) the relational model was expressly intended to avoid.

The fact that duplicates serve as an optimization inhibitor is particularly frustrating in view of the fact that, in most cases, users probably don’t care how many duplicates appear in the result. In other words:

  • Different formulations produce different results.

  • However, the differences are probably irrelevant from the user’s point of view.

  • But the optimizer is unaware of this latter fact and is therefore prevented, unnecessarily, from performing the transformations it might like to perform.

On the basis of examples like the foregoing, I’m tempted to say you should always ensure that query results contain no duplicates—for example, by always specifying DISTINCT in your SQL queries—and thus simply forget about the whole problem (and if you follow this advice, there can be no good reason for having duplicates in the first place!). However, I’ll have more to say about this suggestion in the section immediately following.



[50] One reviewer felt strongly that an even more powerful practical argument (in fact, the most practical argument of all) is simply that duplicates don’t match reality—a database that permits duplicates just hasn’t been designed properly and can’t be, as I put it in Chapter 1, “a faithful model of reality.” I’m very sympathetic to this position. But this book isn’t about database design, and duplicates aren’t just a design issue in any case. Thus, what I’m trying to do here is show the problems duplicates can cause, regardless of whether they’re due to bad design. A detailed analysis of this whole issue, design aspects included, can be found in the paper “Double Trouble, Double Trouble” (see Appendix G).

[51] Here’s as good a place as any to stress the point that—contrary to common commercial practice, perhaps— my use of the unqualified term “optimization” (and related terms) in this book always refers to something the DBMS is responsible for, not something the user has to do. In other words, I’m not talking about what’s sometimes called “hand optimization.”

[52] I once quoted this line in a seminar, and an attendee said “You can say that again!” To which I replied “Yes—there’s a logical difference between logic and rhetoric.”

[53] I note in passing, therefore, that duplicates violate one of the original objectives of the relational model, which was explicitness—the meaning of the data should be as obvious and explicit as possible, since databases are supposed to be suitable for sharing among a variety of disparate users. The presence of duplicates strongly suggests that part of that meaning is hidden instead of being explicit. In fact, duplicates violate one of the most fundamental relational principles of all: viz., The Information Principle (discussed further in Appendix A).

Get SQL and Relational Theory, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.