Chapter 4. Probability Essentials

I provide a high-level background on probability theory and its use in Prescriptive Analytics in this chapter. The reason for this chapter is twofold. First, decision-makers deal with uncertainties every day: they make decisions under a cloud of uncertainty as I noted in Chapter 1. There are no definitive statements, conclusions, or answers to the complex issues they must decide. Each decision involves a simple “Go/No Go” order such as:

  • Reduce price or not

  • Perform preemptive maintenance on a manufacturing robot today or wait until it fails

  • Introduce a new product or not

  • Remain at the current head count or reduce (increase) it

  • Sell part of the business or not

  • Make the decision today or wait until next year

This is a small listing, not meant to be exhaustive. It is, nonetheless, sufficient to show what decision-makers confront. And for each decision, anything could result. So they try their best to get the most Rich Information to help them make the right choice to maximize their chance for success. In some instances, however, they may be so overwhelmed by the complexity of the decision that they may even feel they would do as well tossing a (fair) coin and letting the “chips fall as they may.” This, of course, immediately introduces probabilities into decision making. So the second reason for this chapter is that uncertainties are expressed in probabilistic terms.

Unfortunately, many decisions are not as simple as my examples. Those are deceiving. Consider the first one: reducing a price. This appears simple enough. But three questions are begged:

  • How much should the price be reduced?

  • When will you reduce the price?

  • What will happen if the price is not reduced?

There may be several candidate price points constituting a decision menu. For each price on the menu, there is a probability of success measured by a KPM such as incremental net revenue earned, the increment being over the trend net revenue without a price change (i.e., status quo). What is that probability? It is important to know this because that probability, actually a probability distribution for each price on the menu, will inform the decision-maker about what should be done, which is what Prescriptive Analytics strives to do.

I will explore in this chapter some of the foundational topics of probability theory useful for Prescriptive Analytics. This is not meant to be a comprehensive treatment, let alone a treatise, on probability theory. For comprehensive treatments, see Keynes (1921), Feller (1950), Feller (1971), and Hajek (2019). These are highly technical. For an application-oriented book, but yet with theoretical developments, see Ross (2014).

The leading questions for this chapter are:

  • What is the role of probabilities in everyday affairs?

  • What is the meaning and interpretation of probabilities?

  • What are the fundamental probability concepts?

  • What are the key probability distributions and their implementation in Python?

The World Is Ruled by Probabilities

There are two ways to view everyday events in the so-called real world: they are either stochastic or non-stochastic. The word stochastic means randomly determined or selected from a probability distribution. You can view “stochastic” as a synonym for “random” although stochasticity and randomness are distinct concepts. I will use them, however, as synonyms. See Wikipedia (2023m) for some comments.

If stochastic means random, then non-stochastic means non-random or deterministic. The extreme implication is that whatever happens is predetermined by a powerful, omnipotent force which we have no control over. This brings up the entire topic of Free Will. There is, of course, a very deep and extensive philosophical literature on this, a literature too deep for our discussion and beyond the scope of this book. Nonetheless, the idea that events are either random or deterministic has profound importance for Prescriptive Analytics. In fact, it has profound importance for Predictive Analytics which feeds into Prescriptive Analytics because each relies on not knowing what will happen because of a decision. If everything is deterministic (i.e., non-stochastic), then you do not even have to make a decision; the event will happen regardless of what you do. Even your very choice is determined. See Pink (2004).

Aside from philosophical discussions, most people, and this includes business managers with great responsibilities, will (correctly) contend that everything is not determined, but random. They base this on their everyday experience. The weather on a particular day (rain, cloudy, or sunny; mild or blustery; cold or warm); the traffic pattern on the way to work; train or bus delays; an encounter in an elevator with a potential new client or your boss: these all occur with neither rhyme nor reason. Even in our personal lives, chance occurrences change us, sometimes for the good, other times not so good; sometimes in a small way, other times in a big way; sometimes temporarily, other times permanently. In all cases, the events are unforeseen, unexpected, unpredictable—just random.

The outcome of a decision (e.g., go to work, go to dinner, sleep-in late) could be anything. Some events are more likely to occur than others because all the “anythings” are distributed in some fashion. They are drawn from a probability distribution. In a business context, the incremental change in net revenue from a price reduction is a draw from a probability distribution; a merger decision could be jeopardized by an unexpected intervention by the Department of Justice (DOJ) for antitrust violations, the intervention being a draw from a binary (to intervene or not) distribution; a competitor introduces a superior new product one month after you introduce your new product, the degree of superiority and timing being independent draws from two separate distributions.

My position is that all events, more specifically the outcomes of decisions, are all stochastic, the results of draws from probability distributions. But more importantly, these random draws are in every part of our lives and, thus, rule our lives. This means the menu presented to a decision-maker should have a probability assigned to each option, but what actually occurs after the decision is made is anyone’s guess. I will revisit the probabilities for menu options in Chapter 6 when I discuss decision trees and decision analysis.

What Are Probabilities?

An obvious question is: “What is a probability?” This is not easy to answer.

Almost everyone believes they have an intuitive, almost sixth sense of probabilities, most likely developed in early grade school when they were first tossing a coin or a die.1

Consider a coin toss. Most school-age children understand there is a 50% chance of getting a heads-up on a toss. This may happen by tossing a coin as a game in a school playground or by a K-12 math teacher demonstrating percentages in a math lesson. Regardless of where or how this lesson is learned, young children do grasp the concept. After that, they treat it as “intuitively obvious.” Consequently, they believe they know exactly what is a probability.

Unfortunately, defining probabilities has vexed mathematicians and philosophers for two thousand years. Yet this is a concept we have to deal with, it cannot be avoided, because, as noted by Hajek (2019, p. 2):

Probability is virtually ubiquitous. It plays a role in almost all the sciences. It underpins much of the social sciences…. It finds its way, moreover, into much of philosophy.…Since probability theory is central to decision theory and game theory, it has ramifications for ethics and political philosophy.…Thus, problems in the foundations of probability bear at least indirectly, and sometimes directly, upon central scientific, social scientific, and philosophical concerns. The interpretation of probability is one of the most important foundational problems.

Part of the problem defining probabilities is its use as a synonym for chance.2 And, of course, defining “chance” is philosophically challenging. I cannot go into the details of this issue any more than I can delve into the issues surrounding the definition of probability; there is not enough room in this book plus I would exceed my scope. For issues on interpreting chance, see Eagle (2021).

The concept of probability becomes confusing when you consider the following two statements:

  • There is a 50% chance of getting a heads-up on a toss of a fair coin.

  • There is a 75% chance of rain tomorrow.

The first is a belief, but one based on the experience of tossing a (fair) coin. Tossing a coin a number of times is an experiment, with each toss a trial in the experiment. These terms are not introduced to someone until they take a basic statistics course or a course in probability theory if they are a math major in college. Regardless of where or when they are introduced, the notion of repeating the toss is treated as “intuitively obvious,” so, once again, they believe they understand the concept of a probability.

If the experiment notion is introduced in a statistics course, most likely the professor will run a number of trials in a simulation of a coin toss and show that the proportion of heads will approach 50%. The proportion will, in fact, become exactly 50% as the number of trials becomes very large. I will show such a simulation in “Example 1: Coin Toss”. In that simulation, the conditions for the (fair) coin toss are repeated. The 50% for the coin toss is experimentally determined. More importantly, we make decisions all the time based on these probabilities.

The second statement is fundamentally different, yet almost everyone will say they understand it. What is different is the probability, 75% in this case: it does not result from an experiment. The conditions tomorrow will be unique to tomorrow. In fact, because of the inherent randomness of events I discussed previously, no one can say (i.e., predict) exactly what those conditions will be. The 75% is a belief about what will happen; it is not experimentally determined. It is a subjectively based view of what may happen. Yet we intuitively accept these probabilities and make decisions based on them all the time (e.g., carry an umbrella tomorrow).

So what is a probability? The result of an experiment conducted under fixed conditions and repeated a large number of times? Or a belief that something will or will not happen? Mathematicians and philosophers are torn on these questions. Consequently, there are two opposing schools: experimental or frequency-based on the one side, and belief or subjectively-based on the other side.

Fundamental Probability Concepts

I illustrated these two views of probabilities in Figure 1-3. I will follow that paradigm in this section.

Frequency-Based Probabilities

As I have stated, most people’s first, and perhaps only, introduction to probabilities is through either a coin or die toss, perhaps in the lower grades in school. Let me focus on the coin toss and define the problem somewhat formally, but remaining within the bounds of what you know. I will change these bounds later.

Suppose you toss a normal coin with two sides: heads and tails. The objective is to see which side is face-up when it lands on a surface. The coin is “fair” if it has the same chance of coming up heads or tails on a single toss; it is not weighted to either side. If it is weighted to one side, say heads, then heads will appear more frequently than tails; it is biased to heads, which is unfair. Anyone betting on heads with such a coin will win more times than not; that person will have an unfair advantage.

We define an event as something that happens as a result of the coin toss. It could be a heads-up, denoted by H , or a tails-up, denoted by T . Let me denote by E the set containing the event. This is called the event space. For my example, the event is a heads-up when the coin lands on a surface, so E = { H } where the curly brackets indicate a set. Read this as “The event space is the set containing a head.” The complement to this is the set containing another result. Denote the complement as E C so E C = { T } .

There is only one result in set E , so the size of the set is simply one. It will be useful to have a function that counts the number of objects in the event set even though there is only one object in it at the moment. Let me denote this counting function as n ( E ) . For my example, n ( E ) = 1 .

When you toss a fair coin, there are two possible results, only one of which will happen: getting heads-up or tails-up. We call the set of all possible results the sample space and denote it as S . This is the sample space because an event is drawn, or sampled, from it. Think of S as a population and E as the sample. For the coin toss, S = { H , T } . The counting function also counts the size of this set, so n ( S ) = 2 .

The elementary definition of probabilities is in terms of relative frequencies:

P r ( E ) = n(E) n(S) .

So for the coin toss with E = { H } and S = { H , T } ,

P r ( E ) = P r ( H ) = n(E) n(S) = 1 2

Other probability problems can now be handled by this simple counting. For example, what is the probability of drawing a King from a normal deck of 52 shuffled playing cards? The event is drawing a King. Since there are four Kings in a normal deck, n ( K ) = 4 where K is a King. For the sample space, since this is a normal deck, n ( S ) = 52 . Therefore,

P r ( K ) = n(K) n(S) = 4 52 = 1 13 .

I can now derive an important result. Clearly, the sample space is composed of the event and its complement, so I can write S = E + E C . This allows me to further write

n ( S ) = n ( E ) + n ( E C ) .

Now divide both sides by n ( S ) so

1 . 0 = n(E) n(S) + n(E C ) n(S) = P r ( E ) + P r ( E C ) .

This is a rule for probabilities: they must sum to 1.0 for a sample space. This always holds and is a fundamental property of probabilities. You can check this with the coin toss example. You have P r ( H ) = 1 2 from before, but also P r ( T ) = 1 2 where P r ( T ) is the complement to P r ( H ) . They sum to 1.0. This is the first rule of probabilities: i=n n P r ( E i ) = 1 . 0 .

This all depends on the counting function, n ( · ) , which is a simple count for this problem. For many problems, a more complex counting problem is needed, which is what I will discuss in the next subsection.

Counting Functions

Counting functions are in the class of functions and methods called combinatorics. This mathematical area deals with ways to arrange objects, but, more importantly, how to count the number of arrangements.

To understand this, let me first focus on counting, then on arrangements. For counting, I could (inefficiently) simply count as in 1, 2, 3, and so on. This quickly becomes cumbersome for complex problems. For example, consider a trip from Princeton, NJ to the Museum of Natural History in Upper Manhattan, New York City. This involves two legs: going from Princeton to Lower Manhattan and then from Lower Manhattan to Upper Manhattan for the Museum. Suppose there are three ways to travel from Princeton to Lower Manhattan: bus, car, light rail train. Once in Lower Manhattan, there are four ways to travel to Upper Manhattan to the Museum: cross-town bus, subway, taxi, or walk.3 I show these options in Figure 4-1.

Figure 4-1 is a decision tree, which I will return to in Chapter 6. Trees are fundamental to decision analysis, another important topic in OR. For a menu selection problem, each branch of the tree is a menu option. A probability can be assigned to the branches as I noted previously. I will return to these probabilities in Chapter 6, but ignore them now.

There are three ways into Manhattan, and for each of these there are four ways to get to the Museum. The total number of ways is 12, which equals 3 × 4 . This is not a coincidence. It is an application of the Fundamental Principle of Counting.

hopa 0401
Figure 4-1. A decision tree showing the many ways to travel to a New York City museum

I need to clarify the word “arrangement.” Formally, an arrangement is the placing of objects in a line. For example, if you have two objects, a and b , then a possible placement is the two-tuple ( a , b ) . The placement, however, can be in any order; it is just the placement that matters. An arrangement is a placement in which the order of the objects matters for distinguishing one placement from another. For our two objects, the tuples ( a , b ) and ( b , a ) are different even though the two objects are in both. When the order matters, we say the objects are arranged or permuted, and the result is a permutation of the objects.

If n is the number of distinct objects, then the number of arrangements using all n objects is n × ( n - 1 ) × ( n - 2 ) × ... × 1 . This is called a factorial, or n -factorial, and is abbreviated as n P n = n ! . n ! is sometimes read as “n-factorial” or “n-bang”. I show an example of how to calculate a factorial in Python in Figure 4-2. I use the math package which has a function called factorial that requires just one argument: the number of the factorial.

hopa 0402
Figure 4-2. How to calculate a factorial in Python using the math package’s function factorial. Notice that n is a positional argument in the function.
Note

Some people define 0! = 1 to avoid dividing by zero. This is not correct. 0! = 1 because it is the number of ways to arrange nothing. See Conway and Guy (1996, p. 65).

If you want to arrange only r < n of the objects, then the factorial is n P r = n × ( n - 1 ) × ( n - 2 ) × ... × ( n - r + 1 ) and is read as “the number of permutations of n objects taking r at a time.” This last statement is simplified as:

n P r = n × ( n - 1 ) × ( n - 2 ) × ... × ( n - r + 1 ) = n×(n-1)×(n-2)×...×(n-r+1)×(n-r)×(n-r-1)×...×1 (n-r)×(n-r-1)×...×1 = n! (n-r)! .

This is much easier to use.

As an example, how many ways can you permute (i.e., arrange) all the letters of “dog” to create new words where a “word” does not mean something you would necessarily find in a dictionary? The answer is 3 P 3 = 3 ! = 3 × 2 × 1 = 6 . So you can create six new “words.” (Try listing them.) For another example, how many ways can you seat 10 people at a conference table with 4 chairs? This is 10 P 4 = 10! (10-4)! = 5 , 040 . (Obviously, do not try to list them all.) I show you how to calculate these in Figure 4-3. This uses the math package which has a function called perm that has two parameters: n and r , in that order. The second parameter, r , is optional, but the first, n , is required. If you omit r , you will get n ! . If r > n , you will get an error message. If n < 0 or r < 0 , you will get an error message.

hopa 0403
Figure 4-3. The math package has a perm function for calculating permutations. Notice that a factorial is just a permutation. Also, I included a comma in the print function’s last term to make the output easier to read.

A more complicated counting problem, and one applicable to probability problems, involves the number of unique arrangements. For permutations, order matters for distinguishing one arrangement from another. So ( a , b ) differs from ( b , a ) . However, for probability problems, order does not matter. In this case, you want to count the number of arrangements without regard to order. These are called combinations. A combination is a permutation with duplicate arrangements deleted since the duplicates are redundant: they add nothing to a probability. If you calculate the permutations as n P r , you will find that r ! terms are duplicates and should be deleted. You do this by dividing the number of permutations by r ! . So the number of combinations of n objects using r at a time is:

n C r = n P r r! = n! (n-r)!×r! .

For notation, I use n C r for combinations to be consistent with the permutation notation. Another acceptable notation is (nr).

As an example, suppose the Board of Directors (BOD) wants you, as the data scientist, to interview the members of their C-level team to learn how they use data for decisions. There are five Team members: Chief Executive Officer (CEO), Chief Operating Officer (COO), Chief Financial Officer (CFO), Chief Marketing Officer (CMO), and Chief Legal Officer (CLO). Unfortunately, you are restricted to interviewing only two. How many groups (i.e., combinations) of size r = 2 can you create from the n = 5 officers? You can determine the number of pairs using the combination formula. I show how you implement this in Python in Figure 4-4. I use the math package’s comb function which takes the same arguments as the perm function in Figure 4-3.

hopa 0404
Figure 4-4. The math package has a comb function for calculating the number of combinations.

Independence and Conditional Probability

My discussion has involved one event. What if there are two or more? How does the frequency-based calculation of probabilities change, if at all? Let me consider two, say A and B , and then I can always generalize to n . What is the relationship between the two?

There are only two possibilities: they are either separate events with no interaction or they have some commonality (i.e., they are not separate) with an interaction. An example of the first is getting a 1-dot face-up and a 5-dot face-up on a single toss of a fair die; this is impossible. The two events are mutually exclusive or disjointed: they both cannot happen at the same time; either one happens or the other happens. This is sometimes referred to as a logical or. An example of the second possibility is drawing a card at random from a normal deck of 52 shuffled playing cards and that card being King of Hearts: it is a King and a Heart at the same time. This is a logical and.

There is nothing in common between two mutually exclusive events. The count of A does not depend on the count of B , and vice versa. You express this as n ( A & B ) = 0 , where the ampersand represents the logical or. If you still want to count both events, you have to consider if event A occurs or event B occurs. You then simply count the occurrence of each and add them: n ( A o r B ) = n ( A ) + n ( B ) . Now you determine the probability of A or B occurring as:

Equation 4-1.  
P r ( A o r B ) = n(AorB) n(S) = n(A)+n(B) n(S) = n(A) n(S) + n(B) n(S) = P r ( A ) + P r ( B ) .

You now have a second rule of probabilities: P r ( E 1 o r E 2 ) = P r ( E 1 ) + P r ( E 2 )  if E 1 and E 2 are mutually exclusive.

However, if A and B are not mutually exclusive so if one occurs then the other also occurs, you cannot simply add them because there is an overlap. If you did add the counts, you would double count the overlap and, therefore, overstate the true count. To avoid this, you subtract the overlap. The size of the overlap is n ( A & B ) . This means the probability of A or B is now:

Equation 4-2.  
P r ( A o r B ) = n(AorB) n(S) = n(A)+n(B)-n(A&B) n(S) = n(A) n(S) + n(B) n(S) - n(A&B) n(S) = P r ( A ) + P r ( B ) - P r ( A & B ) .

This is a third rule of probabilities: P r ( E 1 o r E 2 ) = P r ( E 1 ) + P r ( E 2 ) - P r ( E 1 & E 2 ) if E 1 and E 2 are not mutually exclusive.

Notice that Equation 4-2 is a generalization of Equation 4-1 because if n ( A & B ) = 0 , so there is no overlap, then P r ( A & B ) = 0 .

Now consider a more complex problem. Suppose you conducted a survey of n = 300 customers to determine their opinion of a new product concept you want to introduce. You ask them if they like the concept or not. As part of the survey, you also collect data on their gender so you can understand preferences by gender. I show the (fictitious) data in Figure 4-5. The 100 in the Male-Like cell is the count of respondents who were Male and Liked the new product concept. This is an overlap of two events.

Note

I use the enumerate function inside a list comprehension to create my data in Figure 4-5.

hopa 0405
Figure 4-5. The cross-tabulations (or crosstabs) for fictitious survey data for a new product concept preference. I show how the fictitious were generated and both the crosstab and its normalized version. The crosstable was created using the pandas crosstab method.

The top crosstab in Figure 4-5 shows the frequency or count of each combination (i.e., pair) of gender and liking. Since there are two gender groups and two liking groups, there are four cells as shown. The sum of the four cells equals the sample size of n = 300 . If you divide the cell values by the sample size, you get estimates of the probabilities for each cell. I also show these estimates in Figure 4-5 as the normalized crosstab. The 0.333 value in the Male-Like cell is the estimate of the probability a person selected at random from the population will be Male and Likes the new product concept; it is the joint probability. Similarly for the other cells. Notice the probabilities sum to 1.0. The row totals are called the row marginals which give the estimated probabilities of a person selected at random being male or female—regardless of liking the concept. Similarly, the column totals are the column marginals and give the estimated probabilities of liking the concept—regardless of their gender.

Note

The row marginal is the sum of the columns for each row and, so, is the last column of the table. Similarly for the column marginal.

You can now ask an interesting question: “What is the estimated probability of a randomly selected male liking the concept?” Or, to slightly rephrase it: “What is the estimated probability of a randomly selected person liking the concept, given the person is male?” This is a conditional statement and the probability is a conditional probability. Gender is the condition. We express this conditional probability as P r ( A B ) where the vertical line indicates this is a conditional statement and the symbol after that vertical line is the condition. So P r ( L i k e M a l e ) is the probability of someone liking the new product concept given the person is male.

This type of question is very common, and also very subtle, in real-world applications. You will see examples later. See Pinker (2021) for an extensive discussion and examples of how people overlook these conditionalities in addressing real-world problems.

To answer this conditional question, you have to recognize you are confined to the first row of Figure 4-5 so that the relevant divisor for calculating the estimated probabilities is not the total sample, n = 300 , but the smaller one for the first row, the marginal total of n = 150 . The estimated probability is then P r ( L i k e M a l e ) = 0 . 667 . I show how to calculate these conditional probabilities in Figure 4-6.

hopa 0406
Figure 4-6. How the conditional probabilities can be calculated from the crosstab in Figure 4-5. The condition in this example is gender. The last row labeled “Total” is the column marginal distribution for liking.

If you have a sharp eye, you might notice if you divide the 0.333 in the normalized crosstabs in Figure 4-5 by the row marginal probability for Males, which is 0.500, you get 0.667 (with rounding, of course). This is not a coincidence. It will always be the case that

Equation 4-3.  
P r ( A B ) = n(A&B)/n(S) n(B)/n(S) = Pr(A&B) Pr(B) .

The probability of A occurring given B occurred equals the probability of A and B jointly occurring divided or adjusted by the marginal probability of B occurring alone. The marginal probability adjustment factor is the event to the left of the vertical line in the conditional probability statement on the left-hand side of Equation 4-3. This gives you a fourth rule of probabilities:

P r ( E 1 E 2 ) = Pr(E 1 &E 2 ) Pr(E 2 )

I can now reach a very important conclusion. Suppose liking or not liking the new product concept does not depend on the consumer’s gender: they are independent of gender. The distribution for males is the same as for females; there is only one distribution for liking given by the column marginal distribution (i.e., the last row) in Figure 4-6. This implies that I can write:

P r ( L i k e M a l e ) = Pr(Male&Like) Pr(Male) P r ( L i k e ) = Pr(Male&Like) Pr(Male)

so,

P r ( M a l e & L i k e ) = P r ( L i k e ) × P r ( M a l e ) .
Warning

Be careful about mutual exclusivity and independence. If E 1 and E 2 are mutually exclusive, then P r ( E 1 & E 2 ) = 0 and they are not independent. If E 1 and E 2 are independent, then P r ( E 1 & E 2 ) = P r ( E 1 ) × P r ( E 2 ) and they are not mutually exclusive.

In general, under independence, P r ( A & B ) = P r ( A ) × P r ( B ) : the joint probability can be factored into a product of marginal probabilities. You now have a final rule of probabilities: P r ( E 1 & E 2 ) = P r ( E 1 ) × P r ( E 2 )  under independence.

Summary of Probability Rules

The five rules of probabilities are:

Rule 1

i=n n P r ( E i ) = 1 . 0 .

Rule 2

P r ( E 1 o r E 2 ) = P r ( E 1 ) + P r ( E 2 )  if E 1 and E 2 are mutually exclusive.

Rule 3

P r ( E 1 o r E 2 ) = P r ( E 1 ) + P r ( E 2 ) - P r ( E 1 & E 2 )  if E 1 and E 2 are not mutually exclusive.

Rule 4

P r ( E 1 E 2 ) = Pr(E 1 &E 2 ) Pr(E 2 ) .

Rule 5

P r ( E 1 & E 2 ) = P r ( E 1 ) × P r ( E 2 )  under independence.

Limit Definition of Probabilities

The probabilities in the previous sections all have a common foundation: they are based on frequencies, counts of an event for a defined problem such as tossing a fair die. Those frequencies are normalized by the frequency of all possible events for that problem. This is the classical definition of probability taught in elementary math and statistics courses. A deeper development defines a probability with respect to the number of trials in an experiment. In particular, as the number of trials becomes infinitely large, the ratio of the frequency of the event to the frequency of the sample space converges to a value. That value is the probability. See Clayton (2021) for a discussion.

The complete mathematical theory of probabilities relies on limit theorems. A probability is formally defined as the limit as the count of events goes to infinity. This is, of course, not possible to implement for practical problems, although you can work around this using simulations. I will discuss simulations in Chapter 7. For now, just be aware of the issue. See Clayton (2021) for a discussion of this limiting issue.

Subjective-Based Probabilities: Introduction

In many situations, we quote probabilities, and have an intuitive sense regarding their meaning, that has no logical basis in experiments. For these cases, there are no experiments that can be performed, yet we quote these probabilities all the time. A classic example is the probability of rain I noted in “What Are Probabilities?”. You cannot perform an experiment where you repeat tomorrow exactly as tomorrow will occur—it has not occurred yet, so we cannot know the conditions let alone repeat tomorrow. You cannot repeat tomorrow any more than you can repeat today, yesterday, or any period. You may like to, but you simply cannot. Hence, there are no frequencies available; tomorrow is a unique event, and you do not know what it will be.

The only way to specify these probabilities is by using your knowledge base to specify a most likely “guess” regarding the probability of an outcome. That knowledge base is actually a part of the frequency-based probabilities because you usually state something about the conditions for the experiment. For example, the coin is “fair.” This means the head is as likely as the tail to occur on a toss. This is part of what you know, part of your knowledge base. This is a problem because this likelihood is itself a probability (this is an interpretation of the word likelihood) so you use a probability to calculate a probability—a circular argument. This is, however, usually overlooked. See Clayton (2021) for a discussion of this finer point about frequency-based probabilities.

A non-frequency-based probability, based on your knowledge base, is subjective. What is that knowledge and where does it come from? First, your knowledge is whatever is relevant to a problem. For the probability of rain tomorrow, knowledge of current weather patterns and general meteorological theory are certainly necessary. Knowledge of stock market behavior is obviously useless (to most people). However, to state the probability the stock market will rise tomorrow requires knowledge about the economy, economic theory, financial theory, and past stock market behavior; current weather patterns and meteorological theory are not useful (most times; there are exceptions). The sources of that knowledge are wide and varied. A good, worthwhile descriptor is experience. You know things. And it is your knowledge from experience that allows you to formulate and state these subjective probabilities.

Obviously, your knowledge base must be in place in your mind before, or prior to, you formulating the probabilities since this knowledge is, after all, a base. A probability formed from this prior knowledge is a prior probability (or prior for short).

Your knowledge-base, however, constantly changes as you learn more, as you gain more experience. So you need a mechanism to update your prior, to incorporate that new knowledge into your probability formulation, and revise your prior. The revised prior is a posterior probability (or posterior for short). This can become confusing because that posterior will become a new prior to be further updated when newer information becomes available. Thus, a cycle is established as I illustrate in Figure 4-7.

hopa 0407
Figure 4-7. The circular flow of probability updating as new information becomes available. The posterior is the updated prior which in turn becomes a new prior. This circular flow continues ad infinitum. This is an example of an infinite loop.

Our formulation of subjective probabilities is based on two factors:

  1. A prior probability representing our relevant knowledge before the event occurs

  2. A mechanism for updating the prior as new knowledge becomes available

The mechanism is the important theorem called Bayes’ Theorem. There is a simple derivation of the main result of this theorem, but its use and significance are beyond being simple. Its depth goes beyond just calculating probabilities based on knowledge other than frequencies from experiments. It extends to whole research programs and applications in statistics, econometrics, machine learning, and even marketing, psychology, and philosophy. For statistics, see Kenett et al. (2012), Gelman (2006), and Gelman et al. (2013); for econometric applications, see Zellner (1971); for marketing, see Rossi et al. (2005); for psychology, see Pinker (2021); and for philosophical applications and implications, see Lin (2023). I will discuss Bayes’ Theorem and its application to decision making in the context of a decision menu in Chapter 10.

Bayes’ Theorem: Derivation

The derivation of Bayes’ Theorem relies on the posterior as a conditional statement: the probability conditioned on new information. So you can use the conditional probability statements from “Independence and Conditional Probability”. First, define A as the event you are interested in and I as information. Then, from Equation 4-3 you have

Equation 4-4.  
P r ( A I ) = Pr(A&I) Pr(I)

where P r ( I ) is the probability of the new information. You can reverse this as

Equation 4-5.  
P r ( I A ) = Pr(A&I) Pr(A) .

Now solve for P r ( A & I ) in Equation 4-5 and substitute the result into Equation 4-4 to get

Equation 4-6.  
P r ( A I ) = Pr(IA)×Pr(A) Pr(I) .

This is Bayes’ Theorem. It has a long history. See Clayton (2021) for an interesting historical account as well as Pinker (2021) for its use in rational decision making. Finally, see Paczkowski (2022b, Chapter 8) for a derivation similar to what I show here. A more technical derivation is in Lin (2023, Section 2).

The right-hand side of the Bayes’ Theorem equation has three parts. The P r ( A ) in the numerator is the prior: the probability of A without regard to any knowledge or information, that is, data, beyond what you already know or have. The second term in the numerator, P r ( I A ) , is the probability of you realizing the information or data about an event given that the event occurred. This is called the likelihood. The term in the denominator, P r ( I ) , is the marginal probability of seeing the information you have over all possible states of the event A . It is a scaling factor to ensure the probabilities sum to 1.0. As noted in Paczkowski (2022b, Chapter 8), this marginal probability is P r ( I ) = P r ( I & A ) + P r ( I & ¬ A ) where the symbol ¬ means “not” or “the complement of.” Finally, the term on the left-hand side is the posterior probability.

The entire expression in Equation 4-6 says the prior, initially founded on your existing knowledge base, is updated or revised by the likelihood of seeing your new information or data, the adjustment being normalized by the probability of seeing that information. The posterior becomes the new prior as I illustrated in Figure 4-7. Bayes’ Theorem has become a major force in statistical analysis and is at the heart of the analytical split I illustrated in Figure 1-3.

Bayes’ Theorem: Python Implementation

To illustrate the application of Equation 4-6 in Python, reconsider the new product problem that led to the crosstab in Figure 4-5. Suppose a male customer is randomly selected, perhaps from a database of existing customers. What is the probability the male customer will like the product concept? The “liking” is the information from the survey. Just for this example, further, suppose 65% of surveyed customers like the concept and 35% do not. Also, suppose of those who like it, 51.3% are male, and of those who do not like it, 47.6% are male. I summarize these numbers in Table 4-1.

Table 4-1. Bayes’ Example

P r ( L i k e ) = 0 . 65

P r ( M a l e L i k e ) = 0 . 513

P r ( N o t L i k e ) = 0 . 35

P r ( M a l e N o t L i k e ) = 0 . 476

Using the equation for the conditional probability, then:

P r ( L i k e M a l e ) = Pr(Like)×Pr(MaleLike) Pr(Male) = 0.65×0.513 0.50 = 0 . 667 .

I calculated the probability of a Male as 0 . 65 × 0 . 513 + 0 . 35 × 0 . 476 = 0 . 50 . I implement these calculations in Python in Figure 4-8.

hopa 0408
Figure 4-8. The implementation of Bayes’ Theorem in Python.

I define the two sets of probabilities in Lines 4 and 5. The marginal probability is created in Line 9 using NumPy’s inner function, which calculates an inner or dot product of two vectors. The posterior probability is calculated in Line 13 using the first value of the prior and the likelihood.

Note

This example used the inner product (also called the dot product) of two lists. The inner product is defined as A × B = i=1 n a i b i where A and B are lists (i.e., vectors) of the same length. The NumPy function inner calculates the inner product.

Probability Distributions: Overview

I introduced the concept of probability distribution in the previous section. A distribution summarizes the entire array of probabilities for a class of events. For example, suppose you toss a fair coin three times. The array of possible events are: no heads, one head, two heads, and three heads. As you will see shortly, the corresponding probabilities are [0.125, 0.375, 0.375, 0.125], respectively, which sum to 1.0. This list is a distribution. I show what this looks like in Figure 4-9.

hopa 0409
Figure 4-9. The probability distribution for tossing a fair coin three times. A dictionary defines heads and their corresponding probabilities. I use a list comprehension for the heads labels and the heads list.
Note

Figure 4-9 uses the pandas plot method with the “bar” plot kind. The plot is saved in a variable named ax, which I could call to annotate the graph. In this case, descriptive labels were added to the X-axis.

Three Basic Probability Distributions: Binomial, Uniform, and Normal

There are three probability distributions that appear often in applied analytics, whether Predictive or Prescriptive: binomial, uniform, and normal.

Binomial distribution

The binomial distribution is based on repeated trials of an experiment where the result of a trial is one of two possible outcomes, usually called “success” or “failure.” These are unfortunate labels, but, nonetheless, conventional ones. There is really no reason why you have to use them. There are n trials in the experiment, and each is independent of any other. The probability of a success is assumed to be a constant, p , and, so, a failure is a constant, q = 1 - p , for a single trial. This strong assumption states that the likelihood is the same regardless of the trial. The probability is sometimes written as X B i n ( n , p ) .

Under these assumptions, the probability the random variable X will have x successes in n trials is given by

P r ( X = x ) = n! x!×(n-x)! × p x × q (n-x) = n C x × p x × q (n-x) .

As an example, suppose the experiment is tossing a fair coin n = 3 times. The probability of a heads-up on a single toss is a constant p = 0 . 5 . You want to know the probability of a single heads-up in the three tosses: P r ( X = 1 ) where X is the event of a heads-up. The probability using the previous equation is P r ( X = 1 h e a d s ) = 3 C 1 × 0 . 5 1 × 0 . 5 2 = 0 . 375 . The entire distribution is

3 C 0 × 0 . 5 0 × 0 . 5 3 = 0 . 125 3 C 1 × 0 . 5 1 × 0 . 5 2 = 0 . 375 3 C 2 × 0 . 5 2 × 0 . 5 1 = 0 . 375 3 C 3 × 0 . 5 3 × 0 . 5 0 = 0 . 125 .

This distribution is the binomial probability mass function (pmf). I show a graph of this distribution in Figure 4-9. I use a list comprehension in Line 6 to create a list of heads: [ 0 , 1 , 2 , 3 ] for n = 3 tosses. Another list comprehension in Line 7 creates the labels. These two lists are put into a dictionary in Line 8 which I use to create a DataFrame in Line 9. The binomial package’s probability mass function (pmf) is used in the dictionary for the probabilities. I then plotted the DataFrame using the “bar” option.

Uniform distribution

The uniform distribution is sometimes the first that statistics students are introduced to because of its simplicity and commonsense appeal. Basically, it provides the probability for a range of values in the closed interval [ a , b ] . Usually the interval is [ 0 , 1 ] and we write U U n i f ( 0 , 1 ) . This could be rescaled to cover any general interval [ m i n , m a x ] using

U i New = U i -U Min U Max -U Min × ( U Max New - U Min New ) + U Min New .

This is a linear transformation so that the distributional properties of U are preserved. For example, if U U n i f ( 0 , 1 ) , then X U n i f ( 3 , 6 ) using U i New = 3 × U i + 3 = 6 × U + 3 × ( 1 - U ) . See Paczkowski (2022a, Chapter 5) for a discussion of linear transformations.

The height or density of the curve for a uniformly distributed random variable, U , in the general interval [ a , b ] , is 1 b-a . I show an example in Figure 4-10 for the interval [ 0 , 1 ] .

hopa 0410
Figure 4-10. The pdf for U U n i f ( 0 , 1 ) .

Normal distribution

There are many distributions for a wide range of applications. The most used is the normal distribution, commonly called the “bell-shaped distribution.” You should avoid calling the normal distribution the “bell-shaped distribution”—it is the normal distribution. There are many others that are bell-shaped. For example, the Student t -distribution.

I illustrate a typical normal distribution in Figure 4-11. All distributions are defined by two parameters: the mean and variance. For the one I show in Figure 4-11, the mean is zero and the variance is 1.0. With these parameter settings, the proper name is the standardized normal distribution. For notational purposes, the normal is written as 𝒩 ( μ , σ 2 ) where μ is the mean and σ 2 is the variance. For the distribution in Figure 4-11, the notation is 𝒩 ( 0 , 1 ) . You will see this, and many others throughout this book.

hopa 0411
Figure 4-11. The normal distribution with mean of zero and variance 1.0.

Key Distribution Parameters

All distributions are characterized by two parameters: mean and variance.

The mean of a random variable X is called the expected value, represented by E ( X ) . Its definition depends on the type of random variable: discrete or continuous. If the random variable, X , is discrete, then

E ( X ) = - + x i p ( x )

where p ( x ) = Pr ( X = x ) , the probability that X = x , is the pmf with 0 p ( x ) 1 and p ( x ) = 1 . So E ( X ) is just a weighted average.

If X is continuous, then

E ( X ) = - + x f ( x ) d x

with f ( x ) 0 , x , and - + f ( x ) = 1 . The function f ( x ) is the probability density function (pdf) of X at x . The pdf is NOT the probability of seeing X = x .

The variance is comparably defined except that squared deviations from the mean are used:

V ( X ) = - + [x i -E(X)] 2 p ( x )

and

V ( X ) = - + [x i -E(X)] 2 f ( x ) d x .

I show a function in Figure 4-12 for calculating the expected value for a discrete random variable. I then show an application of this function in Figure 4-13 using a small list of values and associated probabilities.

hopa 0412
Figure 4-12. A function to calculate the expected value and variance for a discrete random variable. Notice that I used a list comprehension and the enumerate function to do the calculations.
hopa 0413
Figure 4-13. How to calculate the expected value and variance for a discrete random variable. I use the function in Figure 4-12.

Summary

This is the second of three primers for this book. This one covers probabilities, the backbone of all data science efforts to extract Rich Information from data. The reason for probabilities is simple: all data reflect random variations due to unknown and unknowable causes. This is the Noise in the paradigm: D a t a = I n f o r m a t i o n + N o i s e .

In this chapter, I reviewed basic probability concepts summarized as fundamental probability laws. I then showed how you can do some probability calculations, derived Bayes’ Theorem, discussed prior and posterior probabilities, and showed you how to access probability distributions using Python’s random and NumPy packages.

The main concepts from this chapter are:

  • The intuitive and formal probability concept

  • Five Probability Rules

  • Bayes’ Theorem: its derivation, interpretation, and importance in data science and Prescriptive Analytics

  • Prior and posterior probabilities

  • Probability distributions using Python

1 The word “dice” is plural and the word “die” is singular for a cube with dots, called “pips,” on each side. The cube is used in games.

2 When I ask my students what is a probability, they say it is a chance. When I ask them what is chance, they say it is a probability. A little circular and not helpful.

3 I assume if I drive into New York City, I will leave my car at a City parking lot because driving in New York is hazardous to my health. So driving my car is not an option for me. The light rail will leave me at Penn Station and the bus will leave me at the Port Authority Bus Terminal, both within blocks of each other.

Get Hands-On Prescriptive Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.