**Chapter 4**

**No Duplicates, No Nulls**

*I haven’t even mentioned yet the way the silly notions*

*Discussed so far interreact and lead us into oceans*

*Of complication and despond and general distress.*

*Are two nulls equal (duplicates)? I fear, both NO and YES.*

—Anon.: *Where Bugs Go*

In the previous chapter, I said the following (approximately):

Relations never contain duplicate tuples, because the body of a relation is a set (a set of tuples) and sets in mathematics don’t contain duplicate elements.

Relations never contain nulls, because the body of a relation is a set of tuples, and tuples in turn never contain nulls.

I also suggested that since there was so much to be said about these topics, it was better to devote a separate chapter to them. This is that chapter. *Note:* By definition, the topics in question are SQL topics, not relational ones; in this chapter, therefore, I’ll use the terminology of SQL rather than that of the relational model (for the most part, at any rate).

**WHAT’S WRONG WITH DUPLICATES?**

There are numerous practical arguments in support of the position that duplicate rows (“duplicates” for short) should be prohibited. Here I want to emphasize just one—but I think it’s a powerful one.^{1} However, it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

I assume you know that relational DBMSs include a component called the

*optimizer*,^{2}whose job is to try to figure out the best way to implement user queries and the like (where “best” basically means*best performing*).I assume you also know that one of the things optimizers do is what’s sometimes called

*query rewrite*. Query rewrite is the process of transforming some relational expression*exp1*(representing some user query, say) into another such expression*exp2*, such that*exp1*and*exp2*are guaranteed to produce the same result when evaluated but*exp2*performs better than*exp1*(at least, we hope so).*Note:*Be aware, however, that the term*query rewrite*is also used in certain commercial products with a different, typically more limited meaning.

Now I can present my argument. The fundamental point I want to make is that certain expression transformations, and hence certain optimizations, that would be valid if SQL were truly relational aren’t valid if duplicate rows are allowed. By way of example, consider the (nonrelational) database shown in Fig. 4.1. Note right away that the tables in that database have no keys (and hence no primary keys a fortiori, which is why there’s no double underlining in the figure). And by the way: If you’re thinking the database is totally unrealistic—and especially if you’re thinking that because of that fact you’re not going to be convinced by the arguments that follow—I politely request that you suspend judgment until you’ve seen the further discussion of this example at the beginning of the next section, “Duplicates: Further Issues.”

Fig. 4.1: A nonrelational database, with duplicates

Before going any further, perhaps I should ask the question: What does it mean to have three (P1,Screw) rows in table P and not two, or four, or seventeen? It must mean something, for if it means nothing, then why are the duplicates there in the first place? As I once heard Ted Codd say: If something is true, saying it twice doesn’t make it any more true.^{3}

So I have to assume there’s some meaning attached to the duplication, even though that meaning, whatever it is, is hardly very explicit. Given that duplicates do have some meaning, therefore, there are presumably going to be business decisions made on the basis of the fact that, for example, there are three (P1,Screw) rows in table P and not two or four or seventeen. For if not, then (to repeat) why are the duplicates there in the first place?

*Aside:* In fact the foregoing paragraph touches on another point: namely, that duplicates violate one of the original objectives of the relational model. The objective in question is *explicitness*; that is, the meaning of the data in the database should be as explicit and obvious as possible (since databases are supposed to be suitable for sharing among a wide variety of disparate users and applications). As we’ve just seen, however, the presence of duplicates strongly suggests that part of the meaning of that data is not explicit but hidden. In fact, duplicates can be regarded as violating one of the most fundamental relational principles of all: viz., *The Information Principle* (to be discussed in Appendix A). *End of aside*.

Now consider the following query on the database of Fig. 4.1: “Get part numbers for parts that either are screws or are supplied by supplier S1, or both.” Here are some candidate SQL formulations for this query, together with the result produced in each case:

SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

OR P.PNO IN

( SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1' )Result: P1 * 3, P2 * 1.

SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'

OR SP.PNO IN

( SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw' )Result: P1 * 2, P2 * 1.

SELECT P.PNO

FROM P , SP

WHERE ( SP.SNO = 'S1' AND

SP.PNO = P.PNO )

OR P.PNAME = 'Screw'Result: P1 * 9, P2 * 3.

SELECT SP.PNO

FROM P , SP

WHERE ( SP.SNO = 'S1' AND

SP.PNO = P.PNO )

OR P.PNAME = 'Screw'Result: P1 * 8, P2 * 4.

SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

UNION ALL

SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'Result: P1 * 5, P2 * 2.

SELECT DISTINCT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

UNION ALL

SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'Result: P1 * 3, P2 * 2.

SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

UNION ALL

SELECT DISTINCT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'Result: P1 * 4, P2 * 2.

SELECT DISTINCT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

OR P.PNO IN

( SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1' )Result: P1 * 1, P2 * 1.

SELECT DISTINCT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'

OR SP.PNO IN

( SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw' )Result: P1 * 1, P2 * 1.

SELECT P.PNO

FROM P

GROUP BY P.PNO , P.PNAME

HAVING P.PNAME = 'Screw'

OR P.PNO IN

( SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1' )Result: P1 * 1, P2 * 1.

SELECT P.PNO

FROM P , SP

GROUP BY P.PNO , P.PNAME , SP.SNO , SP.PNO

HAVING ( SP.SNO = 'S1' AND

SP.PNO = P.PNO )

OR P.PNAME = 'Screw'Result: P1 * 2, P2 * 2.

SELECT P.PNO

FROM P

WHERE P.PNAME = 'Screw'

UNION

SELECT SP.PNO

FROM SP

WHERE SP.SNO = 'S1'Result: P1 * 1, P2 * 1.

*Aside:* Actually, certain of the foregoing formulations—which?—are a little suspect, because they effectively assume that every screw is supplied by at least one supplier. But this fact makes no material difference to the argument that follows. *End of aside.*

The first point to notice, then, is that the twelve different formulations produce nine different results: different, that is, with respect to their *degree of duplication*. (By the way, I make no claim that the twelve different formulations and the nine different results are the only ones possible; indeed, they aren’t, in general.) Thus, if the user really cares about duplicates, then he or she needs to be extremely careful in formulating the query in such a way as to obtain exactly the desired result.

Furthermore, analogous remarks apply to the system itself: Because different formulations can produce different results, the optimizer too has to be extremely careful in its task of expression transformation. For example, the optimizer isn’t free to transform, say, formulation 1 into formulation 12 or the other way around, even if it would like to. In other words, duplicate rows act as a significant *optimization inhibitor*. Here are some implications of this fact:

The optimizer code itself is harder to write, harder to maintain, and probably more buggy— all of which combine to make the product more expensive and less reliable, as well as later in delivery to the marketplace, than it might be.

System performance is likely to be worse than it might be.

Users are going to have to get involved in performance issues. To be more specific, they’re going to have to spend time and effort in figuring out how to formulate a given query in order to get the best performance—a state of affairs that (as noted in Chapter 1) the relational model was expressly intended to avoid.

The fact that duplicates serve as an optimization inhibitor is particularly frustrating in view of the fact that, in most cases, users probably *don’t* care how many duplicates appear in the result. In other words: Different formulations produce different results; however, the differences are probably irrelevant from the user’s point of view; but the optimizer is unaware of this latter fact and is therefore prevented, unnecessarily, from performing the transformations it might like to perform.

**DUPLICATES: FURTHER ISSUES**

There’s much, much more that could be said regarding duplicates and what’s wrong with them, but in this section I’ll limit myself to just three further points. The first has to do with the fact that in practice (as I mentioned earlier) base tables, at least, almost never do contain duplicate rows, and hence that the example in the previous section might reasonably be regarded as unrealistic. Well, all right; but the trouble is, SQL can *generate* duplicates in query results. Indeed, different formulations of “the same” query can produce results with different degrees of duplication, even if the input tables themselves have no duplicates at all. By way of illustration, let’s see what happens to that example from the previous section if we revise the database to make the base tables duplicate free, as in Fig. 4.2 (thanks to a reader of the previous edition, Ed Hynes, for drawing this example to my attention):

Fig. 4.2: A relational database, without duplicates

Now, in the previous section I showed twelve different formulations of the query “Get part numbers for parts that either are screws or are supplied by supplier S1, or both” against the database of Fig. 4.1. Well, here are the results produced by those same twelve formulations against the revised version of the database in Fig. 4.2 (*Exercise:* Check these!):

P1 * 3, P2 * 1

P1 * 2, P2 * 1

P1 * 5, P2 * 3

P1 * 6, P2 * 2

P1 * 2, P2 * 2

P1 * 2, P2 * 2

P1 * 2, P2 * 2

P1 * 1, P2 * 1

P1 * 1, P2 * 1

P1 * 3, P2 * 1

P1 * 5, P2 * 3

P1 * 1, P2 * 1

As you can see, the twelve formulations still produce several different results (results, that is, that differ with respect to their degree of duplication). As I claimed above, therefore, it’s clear that even if the input tables themselves don’t contain any duplicates, different formulations of the same query can produce results with different degrees of duplication, and optimization is thus still inhibited. So the message is: Making sure that base tables never contain any duplicate rows is *necessary but not sufficient* to avoid duplicate rows entirely.

At the risk of beating a dead horse, I’d like to pursue this point just a moment longer and consider a much simpler example (I didn’t lead with this example because it’s almost too simple, a fact that can make it easy to miss the real significance of what’s going on). Here are two possible formulations of the query “Get supplier numbers for suppliers who supply at least one part” on our usual suppliers-and-parts database (and note that this time the input tables most definitely don’t contain any duplicates):

SELECT SNO │ SELECT SNO

FROM S │ FROM S NATURAL JOIN SP

WHERE SNO IN │

( SELECT SNO │

FROM SP ) │

At least one of these expressions—which?—will produce a result with duplicates, in general. (*Exercise:* Given our usual sample data values, what results do the two expressions produce?)

So what do we conclude from examples like the ones above and the one discussed in the previous section? Well, what I’d *like* to conclude is that you should abide by the following suggestions (and if you do, you can then just forget about the duplicates problem entirely):

First, never allow duplicates in base tables (by always specifying at least one key—see Chapter 5).

Second, ensure that query results never contain duplicates (for example, by always specifying DISTINCT in your SQL queries).

Unfortunately, however, life is never quite as simple as we might like, and the second of these suggestions, at least, needs more discussion and explanation. But let me leave it at that for now; I’ll come back and revisit it in the next section (“Avoiding Duplicates in SQL”).

I turn now to my second point. The fact is, there’s another at least psychological argument against duplicates that I think is quite persuasive (thanks to Jonathan Gennick for this one): If, in accordance with the *n*-dimensional perspective on relations discussed in Chapter 3, you think of a table as a plot of points in some *n*-dimensional space, then duplicate rows clearly don’t add anything—they simply amount to plotting the same point twice.

My final point is this. Suppose table *T* does permit duplicates. Then we can’t tell the difference between “genuine” duplicates in *T* and duplicates that arise from errors in data entry on *T!* For example, suppose the person responsible for data entry unintentionally enters the very same row twice—e.g., by inadvertently hitting the return key twice (easily done, by the way). Then there’s no straightforward way to delete the “second” row without deleting the “first” as well. Note that we presumably do want to delete that “second” row, since it shouldn’t have been entered in the first place.

**AVOIDING DUPLICATES IN SQL**

The relational model prohibits duplicates; to use SQL relationally, therefore, steps must be taken to prevent them from occurring. Now, if every base table has at least one key (see Chapter 5), then duplicates will never occur in base tables as such. As we’ve seen, however, certain SQL expressions can still yield result tables with duplicates. Here are some of the cases in which such tables can be produced:

SELECT ALL

UNION ALL

VALUES (i.e., table value constructor invocations)

Regarding VALUES, see Chapter 3. Regarding ALL, note first that this keyword (and its alternative, DISTINCT) can be specified:

In a SELECT clause, immediately following the SELECT keyword

In a union, intersection, or difference, immediately following the applicable keyword (UNION, INTERSECT, and EXCEPT, respectively)

Inside the parentheses in an invocation of a “set function” such as SUM, immediately preceding the argument expression

^{4}

*Note:* DISTINCT is the default for UNION, INTERSECT, and EXCEPT; ALL is the default in the other cases.

Now, the “set function” case is special; you must specify ALL, at least implicitly, if you want the function to take duplicate values into account, which sometimes you do (see Chapter 7). But the other cases have to do with elimination of duplicate rows, which must always be done, at least in principle, if you want to use SQL relationally. Thus, the obvious recommendations in those cases are: Always specify DISTINCT; preferably do so explicitly; and never specify ALL. Then you can just forget about duplicate rows entirely.

In practice, however (and as previously noted), matters aren’t quite that simple. Why not? Well, I don’t think I can do better here than repeat the essence of what I wrote in this book’s predecessor (*Database in Depth*, O’Reilly Media Inc., 2005):

At this point in the original draft, I added that if you find the discipline of always specifying DISTINCT annoying, don’t complain to me—complain to the SQL vendors instead. But my reviewers reacted with almost unanimous horror to my suggestion that you should always specify DISTINCT. One wrote: “Those who really know SQL well will be shocked at the thought of coding SELECT DISTINCT by default.” Well, I’d like to suggest, politely, that (a) those who are “shocked at the thought” probably know the implementations well, not SQL, and (b) their shock is probably due to their recognition that those implementations do such a poor job of optimizing away unnecessary DISTINCTs.^{5} If I write SELECT DISTINCT SNO FROM S ..., that DISTINCT can safely be ignored. If I write either EXISTS (SELECT DISTINCT ...) or IN (SELECT DISTINCT ...), those DISTINCTs can safely be ignored. If I write SELECT DISTINCT SNO FROM SP ... GROUP BY SNO, that DISTINCT can safely be ignored. If I write SELECT DISTINCT ... UNION SELECT DISTINCT ..., those DISTINCTs can safely be ignored. And so on. Why should I, as a user, have to devote time and effort to figuring out whether some DISTINCT is going to be a performance hit and whether it’s logically safe to omit it?—and to remembering all of the details of SQL’s inconsistent rules for when duplicates are automatically eliminated and when they’re not?

Well, I could go on. However, I decided—against my own better judgment, but in the interest of maintaining good relations (with my reviewers, I mean)—not to follow my own advice elsewhere in this book but only to request duplicate elimination explicitly when it seemed to be logically necessary to do so. It wasn’t always easy to decide when that was, either. But at least now I can add my voice to those complaining to the vendors, I suppose.

So the **recommendation** (sadly) boils down to this: First, make sure you know when SQL eliminates duplicates without you asking it to. Second, in those cases where you do have to ask, make sure you know whether it matters if you don’t. Third, in those cases where it matters, specify DISTINCT (but, as Hugh Darwen once said, be annoyed about it). And never specify ALL!

**WHAT’S WRONG WITH NULLS?**

The opening paragraph from the section “What’s Wrong with Duplicates?” applies equally well here, with just one tiny text substitution, so I’ll basically just repeat it: There are numerous practical arguments in support of the position that nulls should be prohibited. Here I want to emphasize just one—but I think it’s a powerful one. But it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

I assume you know that any comparison in which at least one of the comparands is null evaluates to the UNKNOWN truth value instead of TRUE or FALSE. The justification for this state of affairs is the intended interpretation of null as

*value unknown*: If the value of*A*is unknown, then it’s also unknown whether, for example,*A*>*B*, regardless of the value of*B*(even—perhaps especially—if the value of*B*is unknown as well).*Note:*That same state of affairs is also the source of the term*three-valued logic*(3VL). That is, the notion of nulls, as understood in SQL, inevitably leads to a logic in which there are three truth values instead of the usual two. (The relational model, by contrast, is based on conventional two-valued logic, 2VL.)I assume you’re also familiar with the 3VL truth tables for the familiar logical operators— also known as

*connectives*—NOT, AND, and OR (using T, U, and F to stand for TRUE, UNKNOWN, and FALSE, respectively):p │ NOT p p q │ p AND q p q │ p OR q

──┼─────── ─────┼───────── ─────┼────────

T │ F T T │ T T T │ T

U │ U T U │ U T U │ T

F │ T T F │ F T F │ T

U T │ U U T │ T

U U │ U U U │ U

U F │ F U F │ U

F T │ F F T │ T

F U │ F F U │ U

F F │ F F F │ FObserve in particular that NOT returns UNKNOWN if its input is UNKNOWN; AND returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or TRUE; and OR returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or FALSE.

Now I can present my argument. The fundamental point I want to make is that certain boolean expressions—and therefore certain queries in particular—can produce results that are correct according to three-valued logic but not correct in the real world. By way of example, consider the (nonrelational) database shown in Fig. 4.3, in which “the CITY is null” for part P1. Note carefully that the shading in that figure, in the place where the CITY value for part P1 ought to be, stands for *nothing at all*; conceptually, there’s *nothing at all*—not even a string of blanks or an empty string—in that position (which means the “tuple” for part P1 isn’t really a tuple, a point I’ll come back to near the end of this section).

Fig. 4.3: A nonrelational database, with a null

Consider now the following (admittedly rather contrived) query on the database of Fig. 4.3: “Get (SNO,PNO) pairs where either the supplier and part cities are different or the part city isn’t Paris, or both.” Here’s the obvious SQL formulation of this query:

SELECT S.SNO , P.PNO

FROM S , P

WHERE S.CITY <> P.CITY

OR P.CITY <> 'Paris'

Now I want to focus on the boolean expression in the WHERE clause:

( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

(I’ve added some parentheses for clarity.) For the only data we have, this expression evaluates to UNKNOWN OR UNKNOWN, which reduces to just UNKNOWN. Now, queries in SQL retrieve data for which the expression in the WHERE clause evaluates to TRUE, not to FALSE and not to UNKNOWN;^{6} in the example, therefore, nothing is retrieved at all.

But part P1 does have some corresponding city in the real world;^{7} in other words, “the null CITY” for part P1 does stand for some real value, say *c*. Now, either *c* is Paris or it isn’t. If it is, then the expression

( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

becomes (for the only data we have)

( 'London' <> 'Paris' ) OR ( 'Paris' <> 'Paris' )

which evaluates to TRUE, because the first term evaluates to TRUE. Alternatively, if *c* isn’t Paris, then the expression becomes (again, for the only data we have)

( 'London' <> c ) OR ( c <> 'Paris' )

which also evaluates to TRUE, because the second term evaluates to TRUE. Thus, the boolean expression is always true in the real world, and the query should therefore return the pair (S1,P1), *regardless of what real world value the null stands for*. In other words, the result that’s correct according to the logic (meaning, specifically, 3VL) and the result that’s correct in the real world are different!

By way of another example, consider the following query on that same table P from Fig. 4.3 (I didn’t lead with this example because it’s even more contrived than the previous one, but in some ways it makes the point with still more force):

SELECT PNO

FROM P

WHERE CITY = CITY

The real world answer here is surely the set of part numbers currently appearing in P (in other words, the set containing just part number P1, given the sample data shown in Fig. 4.3). SQL, however, will return no part numbers at all.

To sum up: If you have any nulls in your database, you’re getting wrong answers to certain of your queries. What’s more, you have no way of knowing, of course, just which queries you’re getting wrong answers to and which not; all results become suspect. *You can never trust the answers you get from a database with nulls.* In my opinion, this state of affairs is a complete showstopper.

*Aside:* To all of the above, I can’t resist adding that even though SQL does support 3VL, and even though it does support the keyword UNKNOWN, that keyword does *not*—unlike the keywords TRUE and FALSE—denote a value of type BOOLEAN, in SQL. (This is just one of the numerous flaws in SQL’s 3VL support; there are many, many others, but most of them are beyond the scope of this book.) To elaborate briefly: As with 2VL, the SQL type BOOLEAN contains just two values, TRUE and FALSE; “the third truth value” is represented, quite incorrectly, by null! Here are some consequences of this fact:

Assigning UNKNOWN to a variable B of type BOOLEAN actually sets B to null.

After such an assignment, the comparison B = UNKNOWN doesn’t give TRUE— instead, it gives null (meaning, to spell the point out, that SQL apparently believes, or claims, that it’s unknown whether B has the value UNKNOWN). Note, incidentally, that this state of affairs constitutes a clear violation of

*The Assignment Principle*(see Exercise 2.22 in Chapter 2, also Chapter 5).In fact, the comparison B = UNKNOWN

*always*gives null (meaning UNKNOWN), regardless of the value of B, because it’s logically equivalent to the comparison “B = NULL” (not meant to be valid SQL syntax).

To understand the seriousness of such flaws, you might care to meditate on the analogy of a numeric type using null instead of zero to represent zero. *End of aside.*

As with the business of duplicates earlier, there’s a lot more that could be said on the whole issue of nulls, but I just want to close with a brief look at the *formal* argument against them. Recall that, by definition, a null isn’t a value. It follows that:

A “type” that contains a null isn’t a type (because types contain values).

A “tuple” that contains a null isn’t a tuple (because tuples contain values).

A “relation” that contains a null isn’t a relation (because relations contain tuples, and tuples don’t contain nulls).

In fact, nulls (like duplicates) violate one of the most fundamental relational principles of all—viz.,

*The Information Principle*. Once again, see Appendix A for further discussion of that principle.

The net of all this is that if nulls are present, then we’re certainly not talking about the relational model (I don’t know what we are talking about, but it’s not the relational model); the entire edifice crumbles, and *all bets are off*.

**AVOIDING NULLS IN SQL**

The relational model prohibits nulls; to use SQL relationally, therefore, steps must be taken to prevent them from occurring. First of all, a NOT NULL constraint should be specified, either explicitly or implicitly, for every column in every base table (see Chapter 5); then nulls will never occur in base tables as such. Unfortunately, however, certain SQL expressions can still yield result tables containing nulls. Here are some of the situations in which nulls can be produced:

The SQL “set functions” SUM, MAX, MIN (etc.) all return null if their argument is empty (except for COUNT and COUNT(*), which correctly return zero in such a situation).

If a scalar subquery evaluates to an empty table, that empty table is coerced to a null.

If a row subquery evaluates to an empty table, that empty table is coerced to a row of all nulls.

*Note:*A row of all nulls and a null row aren’t the same thing at all, logically speaking (another logical difference here, in fact)—yet SQL does think they’re the same thing, at least some of the time. But it would take us much too far afield to get into the detailed implications of*that*state of affairs here.Outer joins and “union joins” are expressly designed to produce nulls in their result.

^{8}If the ELSE option is omitted from a CASE expression, an ELSE option of the form ELSE NULL is assumed.

The expression NULLIF(

*x*,*y*) returns null if*x*=*y*evaluates to TRUE.The “referential triggered actions” ON DELETE SET NULL and ON UPDATE SET NULL can both generate nulls (obviously enough).

**Strong recommendations:**

Do specify NOT NULL, at least implicitly, for every column in every base table.

Don’t use the keyword NULL in any other context whatsoever (i.e., anywhere other than a NOT NULL constraint or logical equivalent).

Don’t use the keyword UNKNOWN in any context whatsoever.

Don’t omit the ELSE option from a CASE expression unless you’re certain it would never have been reached anyway.

Don’t use NULLIF.

Don’t use outer join, and don’t use the keywords OUTER, FULL, LEFT, and RIGHT (except possibly as suggested in the section “A Remark on Outer Join” below).

Don’t use union join.

Don’t specify either PARTIAL or FULL on MATCH (they have meaning only when nulls are present). For similar reasons, don’t use the MATCH option on foreign key constraints, and don’t use IS [NOT] DISTINCT FROM. (If

*a*and*b*are both nonnull, then*a*IS NOT DISTINCT FROM*b*reduces to*a*=*b*, and*a*IS DISTINCT FROM*b*reduces to*a*<>*b*.)Don’t use IS TRUE, IS NOT TRUE, IS FALSE, or IS NOT FALSE. The reason is that, if

*bx*is a boolean expression, then the following logical equivalences fail to hold only if nulls are present (the symbol “≡” means*is equivalent to*):bx IS TRUE ≡ bx

bx IS NOT TRUE ≡ NOT bx

bx IS FALSE ≡ NOT bx

bx IS NOT FALSE ≡ bxIn other words, IS TRUE and the rest are distractions at best, in the absence of nulls.

Finally, do use COALESCE on every scalar expression that might “evaluate to null” without it. (Apologies for the quotation marks here, but the fact is that the phrase “evaluates to null” is a solecism.)

*Note:* In case you’re not familiar with COALESCE, let me elaborate briefly on the last of these recommendations. Essentially, COALESCE is an operator that lets you replace a null by some nonnull value “as soon as it appears” (i.e., before it has a chance to do any significant damage). Here’s the definition: Let *a*, *b*, ..., *c* be scalar expressions. Then the expression COALESCE (*a*,*b*,...,*c*) returns null if its arguments are all null, or the value of its first nonnull argument otherwise. Of course, to use it sensibly, you do need to make sure at least one of *a*, *b*, ..., *c* is nonnull! Here’s a fairly realistic example:

SELECT S.SNO , ( SELECT COALESCE ( SUM ( ALL SP.QTY ) , 0 )

FROM SP

WHERE SP.SNO = S.SNO ) AS TOTQ

FROM S

In this example, if the SUM invocation “evaluates to null”—which it will do in particular for any supplier that doesn’t have any matching shipments, like supplier S5 in our usual running example—then the COALESCE invocation will replace that null by a zero. (Incidentally, this example also illustrates a situation in which use of ALL instead of DISTINCT isn’t just acceptable but is logically required, though it might be implicit. See Chapter 7.) Given our usual sample data, the query thus produces the following result:

┌─────┬──────┐

│ SNO │ TOTQ │

├═════┼──────┤

│ S1 │ 1300 │

│ S2 │ 700 │

│ S3 │ 200 │

│ S4 │ 900 │

│ S5 │ 0 │

└─────┴──────┘

**A REMARK ON OUTER JOIN**

Outer join is expressly designed to produce nulls in its result and should therefore be avoided, in general. Relationally speaking, it’s a kind of shotgun marriage: It forces two tables into a kind of union—yes, I do mean union, not join—even when the tables in question fail to conform to the usual requirements for union (see Chapter 6). It does this, in effect, by padding one or both of the tables with nulls before doing the union, thereby making them conform to those usual requirements after all. But there’s no reason why that padding shouldn’t be done with proper values instead of nulls, as in this example:

SELECT SNO , PNO FROM SP

UNION

SELECT SNO , 'nil' AS PNO FROM S

WHERE SNO NOT IN ( SELECT SNO FROM SP )

Result (note the row for supplier S5 in particular):

┌─────┬─────┐

│ SNO │ PNO │

├═════┼═════┤

│ S1 │ P1 │

│ S1 │ P2 │

│ S1 │ P3 │

│ S1 │ P4 │

│ S1 │ P5 │

│ S1 │ P6 │

│ S2 │ P1 │

│ S2 │ P2 │

│ S3 │ P2 │

│ S4 │ P2 │

│ S4 │ P4 │

│ S4 │ P5 │

│ S5 │ nil │

└─────┴─────┘

Alternatively, the same result could be obtained by using the explicit SQL outer join operator in conjunction with COALESCE, as here:

SELECT SNO , COALESCE ( PNO , 'nil' ) AS PNO

FROM S NATURAL LEFT OUTER JOIN SP

*Note:* I said “there’s no reason” why the padding shouldn’t be done with proper values, as in the foregoing example, but that phraseology is really a little too glib. The example worked— sort of—because attribute PNO is of type CHAR, and so we could pad with a character string value ('nil' in the example). But what if it had been of some numeric type, say INTEGER? Or, worse, some user defined type? And even in the simple character string case, an argument could be made that the result misrepresents the semantics of the situation (does 'nil' truly represent a part number?). The truth is, padding with a real value instead of null just tends to hide the fact that outer join, no matter what code is used to achieve it, is simply not a respectable operation.^{9} Much better to avoid it altogether.

**CONCLUDING REMARKS**

There are a few final remarks I want to make regarding nulls and 3VL specifically. Nulls and 3VL are supposed to be a solution to the “missing information” problem—but I believe I’ve shown that, to the extent they can be considered a “solution” at all, they’re a disastrously bad one. Before I leave the topic, however, I’d like to raise, and respond to, an argument that’s often heard in this connection. That argument goes something like this:

All of those examples you give where nulls lead to wrong answers are very artificial. Real world queries aren’t like that! More generally, most of your criticisms seem very academic and theoretical—I bet you can’t show any real practical situations where nulls have given rise to the kinds of problems you worry about, and I bet you can’t prove such practical situations do occur.

Needless to say, I have several responses to this argument. The first is: How do we know nulls *haven’t* caused real practical problems, anyway? It seems to me that if some serious real world situation—an oil spill, a collapsed bridge, a wrong medical diagnosis—were found to be due to nulls, there might be valid reasons (nontechnical ones, I mean) why the information would never get out. We’ve all heard stories of embarrassing failures caused by software glitches of other kinds, even in the absence of nulls; in my opinion, nulls can only serve to make such failures much more likely.

Second, suppose someone—me, for example—were to go around claiming that some software product or application contained a serious logical error due to nulls. Can you imagine the lawsuits?

Third and most important, I think those of us who criticize nulls don’t need to be defensive, anyway; I think we should stand those counterarguments on their head, as it were. After all, it’s undeniable that nulls can lead to errors in certain cases. So it’s not up to us to prove that those “certain cases” might include practical, real world situations; rather, it’s up to those who want to defend nulls to prove that they don’t. And I venture to suggest that in practice it would be quite difficult, and very likely impossible, to prove any such thing.

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are a little too complex, in general, to be discussed in detail here. The SQL mechanism of (nonnull) default values can be used in simple cases; but for a more comprehensive approach to the problem—including in particular an explanation of how you can still get “don’t know” answers when you want them, even from a database without nulls—I refer you to Appendix C.

**EXERCISES**

4.1 “Duplicates in databases are a good idea in because duplicates occur naturally in the real world. For example, all pennies are duplicates of one another.” How would you respond to this argument?

4.2 Let *r* be a relation and let *bx* and *by* be boolean expressions. Then there’s a law (used in relational systems to help with optimization, among other things) that states that (*r* WHERE *bx*) UNION (*r* WHERE *by*) ≡ *r* WHERE *bx* OR *by*. If *r* isn’t a relation but an SQL table with duplicates, does this law still apply?

4.3 Let *a*, *b*, and *c* be sets. Then *the distributive law of intersection over union* (also used in relational systems to help with optimization among other things) states that *a* INTERSECT (*b* UNION *c*) ≡ (*a* INTERSECT *b*) UNION (*a* INTERSECT *c*). If *a*, *b*, and *c* are bags instead of sets, does this law still apply?

4.4 Part of the explanation of the FROM clause (in a SELECT - FROM - WHERE expression) in the 1992 version of the standard read as follows:

[The] result of the <from clause> is the ... cartesian product of the tables identified by [the specifications in that <from clause>]. The ... cartesian product, *CP*, is the multiset of all rows *r* such that *r* is the concatenation of a row from each of the identified tables ...

Note, therefore, that *CP* isn’t well defined!—notwithstanding the fact that the standard did go on to say that “The cardinality of *CP* is the product of the cardinalities of the identified tables.” For example, let tables T1 and T2 be as shown here:

T1 T2

┌────┐ ┌────┐

│ C1 │ │ C2 │

├────┤ ├────┤

│ 0 │ │ 1 │

│ 0 │ │ 2 │

└────┘ └────┘

Observe now that all of the following fit the foregoing definition for “the” cartesian product *CP* of T1 and T2 (that is, any of them could be “the” multiset referred to):

CP1 CP2 CP3

┌────┬────┐ ┌────┬────┐ ┌────┬────┐

│ C1 │ C2 │ │ C1 │ C2 │ │ C1 │ C2 │

├────┼────┤ ├────┼────┤ ├────┼────┤

│ 0 │ 1 │ │ 0 │ 1 │ │ 0 │ 1 │

│ 0 │ 1 │ │ 0 │ 1 │ │ 0 │ 2 │

│ 0 │ 1 │ │ 0 │ 2 │ │ 0 │ 2 │

│ 0 │ 2 │ │ 0 │ 2 │ │ 0 │ 2 │

└────┴────┘ └────┴────┘ └────┴────┘

Can you fix up the wording of the standard appropriately?

4.5 Consider the following SQL cursor definition:

DECLARE X CURSOR FOR SELECT SNO , QTY FROM SP ;

Note that (a) cursor X permits updates, (b) the table visible through cursor X permits duplicates, but (c) the underlying table SP doesn’t (permit duplicates, that is). Now suppose the operation DELETE ... WHERE CURRENT OF X is executed. Then there’s no way, in general, of saying which specific row of table SP is deleted by that operation. How would you fix *this* problem?

4.6 Please write out one googol times: There’s no such thing as a duplicate. *Note:* A *googol* is one followed by 100 zeros (i.e., 10 to the hundredth power). A *googolplex* is one followed by a googol zeros (i.e., 10 to the “googolth” power).

4.7 Do you think nulls occur naturally in the real world?

4.8 There’s a logical difference between null and the third truth value: True or false? (Perhaps I should ask: True, false, or unknown?)

4.9 In the body of the chapter, I gave truth tables for one monadic 3VL connective (NOT) and two dyadic 3VL connectives (AND and OR), but there are many other connectives as well (see Exercise 4.10 below). Another useful monadic connective is MAYBE,^{10} with truth table as follows:

P │ MAYBE p

──┼─────────

T │ F

U │ T

F │ F

Does SQL support this connective?

4.10 Following on from the previous exercise, how many distinct connectives are there altogether in 2VL? What about 3VL? What do you conclude from your answers to these questions?

4.11 A logic is *truth functionally complete* if it supports, directly or indirectly, all possible connectives. Truth functional completeness is an extremely important property; a logic without it would be like an arithmetic without support for certain operations, say “+”. Is classical 2VL truth functionally complete? What about SQL’s 3VL?

4.12 Let *bx* be a boolean expression. Then *bx* OR NOT *bx* is also a boolean expression, and in 2VL it’s guaranteed to evaluate to TRUE (it’s an example of what logicians call a *tautology*). Is it a tautology in 3VL? If not, is there an analogous tautology in 3VL?

4.13 With *bx* as in the previous exercise, *bx* AND NOT *bx* is also a boolean expression, and in 2VL it’s guaranteed to evaluate to FALSE (it’s an example of what logicians call a *contradiction*). Is it a contradiction in 3VL? If not, is there an analogous contradiction in 3VL?

4.14 In 2VL, *r* JOIN *r* is equal to *r* and INTERSECT and TIMES are both special cases of JOIN (see Chapter 6). Are these observations still valid in 3VL?

4.15 The following is a legitimate SQL row value constructor invocation: ROW (1,NULL). Is the row it denotes null or nonnull?

4.16 Let *bx* be an SQL boolean expression. Then NOT *bx* and *bx* IS NOT TRUE are both SQL boolean expressions. Are they equivalent?

4.17 Let *x* be an SQL expression. Then *x* IS NOT NULL and NOT (*x* IS NULL) are both SQL boolean expressions. Are they equivalent?

4.18 Let DEPT and EMP be SQL tables; let DNO be a column in both; let ENO be a column in EMP; and consider the expression DEPT.DNO = EMP.DNO AND EMP.DNO = 'D1' (this expression might be part of the WHERE clause in some query, for example). Now, a “good” optimizer might very well transform this expression into DEPT.DNO = EMP.DNO AND EMP.DNO = 'D1’ AND DEPT.DNO = 'D1', on the grounds that *a* = *b* and *b* = *c* together imply that *a* = *c* (see Exercise 6.13 in Chapter 6). But is this transformation valid? If not, why not? What are the implications?

4.19 Suppose the suppliers-and-parts database permits nulls; in particular, suppose columns SP.SNO and SP.PNO permit nulls.^{11} Here then is a query on that database, expressed for reasons beyond the scope of this chapter not in SQL but in a kind of pidgin form of relational calculus (see Chapter 10):

S WHERE NOT EXISTS SP ( SP.SNO = S.SNO AND SP.PNO = 'P2' )

What does this query mean? Is the following formulation equivalent?

S WHERE NOT ( S.SNO IN ( SP.SNO WHERE SP.PNO = 'P2' ) )

4.20 Let *k1* and *k2* be values of the same type. In SQL, then, what exactly does each of the following mean?

*k1*and*k2*are “the same” for the purposes of a comparison in, e.g., a WHERE clause.*k1*and*k2*are “the same” for the purposes of key uniqueness.*k1*and*k2*are “the same” for the purposes of duplicate elimination.

4.21 In the body of the chapter, I said UNION ALL can generate duplicates. But what about INTERSECT ALL and EXCEPT ALL?

4.22 Are the recommendations “Always specify DISTINCT” and “Never specify ALL” duplicates of each other?

4.23 If TABLE_DEE corresponds to TRUE (or *yes*) and TABLE_DUM to FALSE (or *no*), then what corresponds to UNKNOWN (or *maybe*)?

4.24 The following quotes are taken from the SQL standard:^{12}

“The data type boolean comprises the distinct truth values True and False. Unless prohibited by a NOT NULL constraint, the boolean data type also supports the truth value Unknown as the null value. This [standard] does not make a distinction between the null value of the boolean data type and the truth value Unknown ... [They] may be used interchangeably to mean exactly the same thing.”

“All boolean values and SQL truth values are comparable ... The value True is greater than the value False, and any comparison involving the null value or an Unknown truth value will return an Unknown result.”

Do you have any comments on these quotes? In particular, which of the following (if any) do you think are legal SQL expressions? And what do they return, if they’re legal?

TRUE OR FALSE

TRUE OR UNKNOWN

TRUE OR NULL

TRUE > FALSE

TRUE > UNKNOWN

TRUE > NULL

4.25 In his book *Using the New DB2* (Morgan Kaufmann, 1996), in a section titled “A Brief History of SQL,” Don Chamberlin—who is widely acknowledged to be “the father of SQL”— has the following to say (I’m quoting the text more or less verbatim, except that I’ve added some italics):

During the early development of SQL ... some decisions were made that were ultimately to generate a great deal [of] controversy ... Chief among these were the decisions to support null values [*sic*] and to permit duplicate rows ... I will [briefly examine] the reasons for these decisions ... My purpose here is historical rather than persuasive ... *I recognize that nulls and duplicates are religious topics*, and I do not expect anyone to have a conversion experience after reading this chapter.

Do you agree with Chamberlin that nulls and duplicates are “religious topics”?

**ANSWERS**

4.1 To deal with this argument properly would take more space than we have here, but it all boils down to what’s sometimes called *The Principle of Identity of Indiscernibles* (see Appendix A). Let *a* and *b* be any two entities—for example, two pennies. Well, if there’s absolutely no way whatsoever of distinguishing between *a* and *b*, then there aren’t two entities but only one!^{13} Now, it might be true for certain purposes that the two entities can be *interchanged*, but that fact isn’t sufficient to make them indiscernible. (Indeed, there’s a logical difference between interchangeability and indiscernibility, and arguments to the effect that “duplicates occur naturally in the real world” tend to be based on a muddle over this difference.) A detailed analysis of this whole issue can be found in the paper “Double Trouble, Double Trouble” (see Appendix G).

4.2 Before we can answer this question, we need to pin down exactly what WHERE and UNION mean in the presence of duplicates. The paper “The Theory of Bags: An Investigative Tutorial” (see Appendix G) goes into details on such matters; here let me just say that if we adopt the SQL definitions, then the law certainly doesn’t apply. In fact, it doesn’t apply to either UNION ALL or UNION DISTINCT! By way of example, let *T* be an SQL table with just one column—*C*, say—containing just two rows, each of them containing just the value *v*. Then the following expressions produce the indicated results:

SELECT C

FROM T

WHERE TRUE

OR TRUE

Result: *v* * 2.

SELECT C

FROM T

WHERE TRUE

UNION DISTINCT

SELECT C

FROM T

WHERE TRUE

Result: *v* * 1.

SELECT C

FROM T

WHERE TRUE

UNION ALL

SELECT C

FROM T

WHERE TRUE

Result: *v* * 4.

*Note:* If the various (implicit or explicit) ALLs in the foregoing expressions were all replaced by DISTINCT, it would be a different story. What do you conclude?

4.3 Remarks similar to those in the answer to the previous exercise apply here also. Again I’ll skip the details; I’ll just say for the record that, first, the answer depends, of course, on what definitions we adopt for UNION and INTERSECT for bags as opposed to sets; second, with the SQL definitions, the law *doesn’t* apply. I’ll leave development of a counterexample to you.

4.4 As far as I can see, the only way to resolve the ambiguity is by effectively defining a mapping from each of the (multiset) argument tables to a proper set, and likewise defining a mapping of the (multiset) result table—i.e., the desired cartesian product—to a proper set. (The mappings involve attaching a unique identifier to each row.) It seems to me, in fact, that the standard’s failed attempt at a definition here serves only to emphasize the point that one of the most fundamental concepts in the entire SQL language—viz., the idea that tables should permit duplicate rows—is fundamentally flawed, and cannot be repaired without, in effect, dispensing with the concept altogether.

*Note:* As I’m sure you observed, the quoted text was taken from the 1992 version of the standard. Later versions use different wording, and they do manage to fix the problem—but they do so, in effect, by adopting my solution of attaching a unique identifier to each row. I rest my case.

4.5 I don’t think this problem can be fixed.

4.6 *No answer provided!*

4.7 The question was: Do you think nulls occur naturally in the real world? Only you can answer this question—but if your answer is *yes*, I think you should examine your reasoning very carefully. For example, consider the statement, concerning some specific employee Joe, “Joe’s salary is $50,000.” That statement is either true or false. Now, you might not know whether it’s true or false; but your not knowing has nothing to do with whether it actually is true or false. In particular, your not knowing is certainly not the same as saying “Joe’s salary is null”! “Joe’s salary is $50,000” is a statement about the real world. “Joe’s salary is null” is a statement about your knowledge (or lack of knowledge, rather) about the real world. We certainly shouldn’t keep a mixture of these two very different kinds of statements in the same relation, or in the same relvar.

Suppose you had to represent the fact that you don’t know Joe’s salary in some box on some paper form. Would you enter a null, as such, into that form? I don’t think so! Rather, you would leave the box blank, or put a question mark, or write “unknown,” or something along those lines. And that blank, or question mark, or “unknown”—or whatever—is a value, not a null (recall that the one thing we can be definite about regarding nulls is that they aren’t values). Speaking for myself, therefore, no, I don’t think nulls do “occur naturally in the real world.”

4.8 True (though not in SQL!). Null is a marker that represents the absence of information, while UNKNOWN is a value, just as TRUE and FALSE are values. So there’s a logical difference between the two, and to confuse them as SQL does is a logical mistake. (I’d like to say it’s a big logical mistake, but all logical mistakes are big mistakes by definition.)

4.9 Yes, it does; SQL’s analog of “MAYBE *p*” is “*p* IS UNKNOWN”.

4.10 In 2VL there are exactly 4 monadic connectives and exactly 16 dyadic connectives, corresponding to the 4 possible monadic truth tables and 16 possible dyadic truth tables. Here are those truth tables (I’ve indicated the ones that have common names, such as NOT, AND, and OR):^{14}

│ │ NOT│ │

───┼─── ───┼─── ───┼─── ───┼──

T │ T T │ T T │ F T │ F

F │ T F │ F F │ T F │ F

│ T F IF │ T F NAND│ T F │ T F

───┼───── ───┼───── ───┼───── ───┼─────

T │ T T T │ T F T │ F T T │ F F

F │ T T F │ T T F │ T T F │ T T

OR │ T F │ T F XOR│ T F │ T F

───┼───── ───┼───── ───┼───── ───┼─────

T │ T T T │ T F T │ F T T │ F F

F │ T F F │ T F F │ T F F │ T F

│ T F IFF│ T F │ T F NOR│ T F

───┼───── ───┼───── ───┼───── ───┼─────

T │ T T T │ T F T │ F T T │ F F

F │ F T F │ F T F │ F T F │ F T

│ T F AND│ T F │ T F │ T F

───┼───── ───┼───── ───┼───── ───┼─────

T │ T T T │ T F T │ F T T │ F F

F │ F F F │ F F F │ F F F │ F F

In 3VL, by contrast, there are 27 (3 to the power 3) monadic connectives and 19,683 (3 to the power 3²) dyadic connectives. (In general, in fact, *n*VL has *n* to the power *n* monadic connectives and *n* to the power *n*² dyadic connectives.) Many conclusions might be drawn from these facts; one of the most immediate is that 3VL is vastly more complex than 2VL (much more so, probably, than most people, including in particular those who think nulls are a good thing, realize, or at least are prepared to admit to).

4.11 Classical 2VL supports (among other things) NOT, AND, and OR and is thus truth functionally complete, because all possible 2VL connectives can be expressed in terms of NOT and either AND or OR (see the answer to Exercise 10.4 in Chapter 10 for further explanation). And it turns out that SQL’s 3VL—under an extremely charitable interpretation of that term!—is truth functionally complete as well. The paper “Is SQL’s Three-Valued Logic Truth Functionally Complete?” (see Appendix G) discusses this issue in detail.

4.12 It’s not a tautology in 3VL, because if *bx* evaluates to UNKNOWN, the whole expression also evaluates to UNKNOWN. But there does exist an analogous tautology in 3VL: viz., *bx* OR NOT *bx* OR MAYBE *bx*. *Note:* This state of affairs explains why, in SQL, if you execute the query “Get all suppliers in London” and then the query “Get all suppliers not in London,” you don’t necessarily get (in combination) all suppliers; you have to execute the query “Get all suppliers who may be in London” as well. Note the implications for query rewrite; note too the potential for serious mistakes (on the part of both users and the system, I might add—and there’s some history here). To spell the point out: It’s very natural to assume that expressions that are tautologies in 2VL are also tautologies in 3VL, but such is not necessarily the case.

4.13 It’s not a contradiction in 3VL, because if *bx* evaluates to UNKNOWN, the whole expression also evaluates to UNKNOWN. But there does exist an analogous (slightly tricky!) contradiction in 3VL: viz., *bx* AND NOT *bx* AND NOT MAYBE *bx*. *Note:* As you might expect, this state of affairs has implications similar to those noted in the answer to the previous exercise.

4.14 In 3VL (at least as realized in SQL), *r* JOIN *r* isn’t necessarily equal to *r*, and INTERSECT isn’t a special case of JOIN. Why so? Because in SQL, believe it or not, two nulls don’t “compare equal” for join but do “compare equal” for intersection. (I take this state of affairs to be just another of the vast—infinite?—number of absurdities that nulls inevitably seem to lead us into.) However, TIMES is still a special case of JOIN, as it is in 2VL.

4.15 Here are the rules: Let *x* be an SQL row. Suppose for definiteness and simplicity that *x* has just two components, *x1* and *x2* (in left to right order, of course!). Then *x* IS NULL is defined to be equivalent to *x1* IS NULL AND *x2* IS NULL, and *x* IS NOT NULL is defined to be equivalent to *x1* IS NOT NULL AND *x2* IS NOT NULL. For the given row, both of these expressions evaluate to FALSE, and it follows that the row in question is neither null nor nonnull ... What do you conclude from this state of affairs?

By the way: At least one reviewer commented at this point that he’d never thought of a row being null. But rows are values (just as tuples and relations are values), and hence the idea of some row being unknown makes exactly as much sense as, say, the idea of some salary being unknown. Thus, if the concept of representing an unknown value by a “null” makes any sense at all—which of course I don’t think it does—then it surely applies to rows (and tables, and any other kind of value you can think of) just as much as it does to scalars. And as this exercise demonstrates, SQL tries to support this position—at least for rows—but fails. (Of course, it ought logically to support it for tables, too, but in that case it doesn’t even try. I mean, there’s no such thing as a “null table” in SQL.)

4.16 No. Here are the truth tables:

NOT│ IS NOT TRUE│

───┼─── ───────────┼───

T │ F T │ F

U │ U U │ T

F │ T F │ T

4.17 No. For definiteness, consider the case in which *x* is an SQL row. Suppose (as in the answer to Exercise 4.15 above) that *x* has just two components, *x1* and *x2*. Then *x* IS NOT NULL is defined to be equivalent to *x1* IS NOT NULL AND *x2* IS NOT NULL, and NOT (*x* IS NULL) is defined to be equivalent to *x1* IS NOT NULL OR *x2* IS NOT NULL. What do you conclude from *this* state of affairs?

4.18 The transformation isn’t valid, as you can see by considering what happens if EMP.DNO is null (were you surprised?). The implications, once again, are that users and the system are both likely to make mistakes (and again there’s some history here).

4.19 The query means “Get suppliers who are known not to supply part P2” (note that phrase “known not,” and note also the subtle difference between that phrase and “not known”); it does *not* mean “Get suppliers who don’t supply part P2.” The two formulations aren’t equivalent (consider, e.g., the case where the only SP row for part number P2 in table SP has a null supplier number).

4.20 No two of a., b., c. are equivalent. Statement a. follows the rules of SQL’s 3VL; statement b. follows the definition of SQL’s UNIQUE operator; and statement c. follows SQL’s definition of duplicates. In particular, if *k1* and *k2* are both null, then a. gives UNKNOWN, b. gives FALSE, and c. gives TRUE (!). Here for the record are the rules in question:

In SQL’s 3VL, the comparison

*k1*=*k2*gives TRUE if*k1*and*k2*are both nonnull and are equal, FALSE if*k1*and*k2*are both nonnull and are unequal, and UNKNOWN otherwise.With SQL’s UNIQUE operator, the comparison

*k1*=*k2*gives TRUE if and only if*k1*and*k2*are both nonnull and are equal, and FALSE otherwise (see Chapter 11 for further explanation).In SQL,

*k1*and*k2*are duplicates if and only if either (a) they’re both nonnull and equal or (b) they’re both null.

*Note:* Throughout the foregoing, “equal” refers to SQL’s own, somewhat idiosyncratic definition of the “=” operator (see Chapter 2). *Subsidiary exercise:* Do you think these rules are reasonable? Justify your answer.

4.21 The output from INTERSECT ALL and EXCEPT ALL can indeed contain duplicates, but only if duplicates are present in the input; unlike UNION ALL, therefore, these two operators never “generate” duplicates.

4.22 Yes! (We don’t want duplicates in the database, but that doesn’t mean we never want duplicates anywhere else. As I said in the body of the chapter, there’s a logical difference between logic and rhetoric.) By the way, here’s another nice illustration of essentially the same point that I came across only recently: *Good food is not cheap. Cheap food is not good*.

4.23 A very good question.

4.24 Well, I don’t know about you, but I have quite a few comments myself!

First of all, the phrase “the null value” would be better reduced to just “null” throughout.

Second, observe that (as noted in Chapter 4) although SQL supports three-valued logic, its BOOLEAN data type has just two values, TRUE and FALSE; “the third truth value” is represented not by a value at all but by null. This state of affairs explains (?) the distinction drawn in the second quote between “boolean values” and “SQL truth values” —as far as SQL is concerned, there are three truth values (TRUE, FALSE, and UNKNOWN) but only two boolean values (TRUE and FALSE).

Next: “This [standard] does not make a distinction between the null value of the boolean data type and the truth value Unknown ... [They] may be used interchangeably to mean exactly the same thing.” But, of course, null doesn’t

*always*mean “the third truth value,” so null and “the truth value Unknown” certainly can’t be used “interchangeably” as claimed. In fact, the keyword NULL usually can’t be used in place of the keyword UNKNOWN even when UNKNOWN is the sense intended (see c. and f. in the answer to the last part of the exercise below).The phrase “the null value of the boolean data type” is also rather strange, since there’s just a single null and that null, since it isn’t a value, actually has no type at all.

“Unless prohibited by a NOT NULL constraint, the boolean data type also supports the truth value Unknown ...”: NOT NULL doesn’t apply to data types, it applies to

*uses*of data types (typically as part of a column definition).Formal systems (like SQL) in which the truth values are ordered usually define that ordering to be total. In particular, for three-valued logic, the ordering would typically be such that the comparisons TRUE > UNKNOWN and UNKNOWN > FALSE both return TRUE. SQL, however, defines any comparison involving UNKNOWN (even UNKNOWN = UNKNOWN) to return UNKNOWN.

Following on from the previous point: TRUE > UNKNOWN and UNKNOWN > FALSE (etc.) are apparently legal SQL expressions—but they’re not, according to the standard, legal “boolean value expressions” (despite the fact that they do return a boolean value ... or perhaps I should say, despite the fact that they return “an SQL truth value”).

Finally, the six SQL expressions (or would-be expressions):

Legal; returns TRUE.

Legal; returns null (UNKNOWN).

Illegal.

Legal; returns TRUE.

Legal; returns null (UNKNOWN).

Illegal.

4.25 *No answer provided.*

Get *SQL and Relational Theory, 3rd Edition* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.