Chapter 1. Probability
The foundation of Bayesian statistics is Bayes’s theorem, and the foundation of Bayes’s theorem is conditional probability.
In this chapter, we’ll start with conditional probability, derive Bayes’s theorem, and demonstrate it using a real dataset. In the next chapter, we’ll use Bayes’s theorem to solve problems related to conditional probability. In the chapters that follow, we’ll make the transition from Bayes’s theorem to Bayesian statistics, and I’ll explain the difference.
Linda the Banker
To introduce conditional probability, I’ll use an example from a famous experiment by Tversky and Kahneman, who posed the following question:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
Linda is a bank teller.
Linda is a bank teller and is active in the feminist movement.
Many people choose the second answer, presumably because it seems more consistent with the description. It seems uncharacteristic if Linda is just a bank teller; it seems more consistent if she is also a feminist.
But the second answer cannot be “more probable”, as the question asks. Suppose we find 1,000 people who fit Linda’s description and 10 of them work as bank tellers. How many of them are also feminists? At most, all 10 of them are; in that case, the two options are equally probable. If fewer than 10 are, the second option is less probable. But there is no way the second option can be more probable.
If you were inclined to choose the second option, you are in good company. The biologist Stephen J. Gould wrote:
I am particularly fond of this example because I know that the [second] statement is least probable, yet a little homunculus in my head continues to jump up and down, shouting at me, “but she can’t just be a bank teller; read the description.”
If the little person in your head is still unhappy, maybe this chapter will help.
Probability
At this point I should provide a definition of “probability”, but that turns out to be surprisingly difficult. To avoid getting stuck before we start, we will use a simple definition for now and refine it later: A probability is a fraction of a finite set.
For example, if we survey 1,000 people, and 20 of them are bank tellers, the fraction that work as bank tellers is 0.02 or 2%. If we choose a person from this population at random, the probability that they are a bank teller is 2%. By “at random” I mean that every person in the dataset has the same chance of being chosen.
With this definition and an appropriate dataset, we can compute probabilities by counting. To demonstrate, I’ll use data from the General Social Survey (GSS).
I’ll use pandas to read the data and store it in a
DataFrame
.
import
pandas
as
pd
gss
=
pd
.
read_csv
(
'gss_bayes.csv'
,
index_col
=
0
)
gss
.
head
()
year | age | sex | polviews | partyid | indus10 | |
---|---|---|---|---|---|---|
caseid | ||||||
1 | 1974 | 21.0 | 1 | 4.0 | 2.0 | 4970.0 |
2 | 1974 | 41.0 | 1 | 5.0 | 0.0 | 9160.0 |
5 | 1974 | 58.0 | 2 | 6.0 | 1.0 | 2670.0 |
6 | 1974 | 30.0 | 1 | 5.0 | 4.0 | 6870.0 |
7 | 1974 | 48.0 | 1 | 5.0 | 4.0 | 7860.0 |
The DataFrame
has one row for each person surveyed and one column for
each variable I selected.
The columns are
-
caseid
: Respondent id (which is the index of the table). -
year
: Year when the respondent was surveyed. -
age
: Respondent’s age when surveyed. -
sex
: Male or female. -
polviews
: Political views on a range from liberal to conservative. -
partyid
: Political party affiliation: Democratic, Republican, or independent.
Let’s look at these variables in more detail, starting with
indus10
.
Fraction of Bankers
The code for “Banking and related activities” is 6870, so we can select bankers like this:
banker
=
(
gss
[
'indus10'
]
==
6870
)
banker
.
head
()
caseid 1 False 2 False 5 False 6 True 7 False Name: indus10, dtype: bool
The result is a pandas Series
that contains the Boolean values True
and False
.
If we use the sum
function on this Series
, it treats True
as 1 and
False
as 0, so the total is the number of bankers:
banker
.
sum
()
728
In this dataset, there are 728 bankers.
To compute the fraction of bankers, we can use the mean
function,
which computes the fraction of True
values in the Series
:
banker
.
mean
()
0.014769730168391155
About 1.5% of the respondents work in banking, so if we choose a random person from the dataset, the probability they are a banker is about 1.5%.
The Probability Function
I’ll put the code from the previous section in a function
that takes a Boolean Series
and returns a probability:
def
prob
(
A
):
"""Computes the probability of a proposition, A."""
return
A
.
mean
()
So we can compute the fraction of bankers like this:
prob
(
banker
)
0.014769730168391155
Now let’s look at another variable in this dataset. The
values of the column sex
are encoded like this:
1 Male 2 Female
So we can make a Boolean Series
that is True
for female respondents
and False
otherwise:
female
=
(
gss
[
'sex'
]
==
2
)
And use it to compute the fraction of respondents who are women:
prob
(
female
)
0.5378575776019476
The fraction of women in this dataset is higher than in the adult US population because the GSS doesn’t include people living in institutions like prisons and military housing, and those populations are more likely to be male.
Political Views and Parties
The other variables we’ll consider are polviews
, which
describes the political views of the respondents, and partyid
, which
describes their affiliation with a political party.
The values of polviews
are on a seven-point scale:
1 Extremely liberal 2 Liberal 3 Slightly liberal 4 Moderate 5 Slightly conservative 6 Conservative 7 Extremely conservative
I’ll define liberal
to be True
for anyone whose response
is “Extremely liberal”, “Liberal”, or “Slightly liberal”:
liberal
=
(
gss
[
'polviews'
]
<=
3
)
Here’s the fraction of respondents who are liberal by this definition:
prob
(
liberal
)
0.27374721038750255
If we choose a random person in this dataset, the probability they are liberal is about 27%.
The values of partyid
are encoded like this:
0 Strong democrat 1 Not strong democrat 2 Independent, near democrat 3 Independent 4 Independent, near republican 5 Not strong republican 6 Strong republican 7 Other party
I’ll define democrat
to include respondents who chose
“Strong democrat” or “Not strong democrat”:
democrat
=
(
gss
[
'partyid'
]
<=
1
)
And here’s the fraction of respondents who are Democrats, by this definition:
prob
(
democrat
)
0.3662609048488537
Conjunction
Now that we have a definition of probability and a function that computes it, let’s move on to conjunction.
“Conjunction” is another name for the logical and
operation. If you
have two propositions, A
and B
, the conjunction A and B
is True
if both A
and B
are
True
, and False
otherwise.
If we have two Boolean Series
, we can use the &
operator to compute
their conjunction. For example, we have already computed the probability
that a respondent is a banker:
prob
(
banker
)
0.014769730168391155
And the probability that they are a Democrat:
prob
(
democrat
)
0.3662609048488537
Now we can compute the probability that a respondent is a banker and a Democrat:
prob
(
banker
&
democrat
)
0.004686548995739501
As we should expect, prob(banker & democrat)
is less than
prob(banker)
, because not all bankers are Democrats.
We expect conjunction to be commutative; that is, A & B
should be the
same as B & A
. To check, we can also compute
prob(democrat & banker)
:
prob
(
democrat
&
banker
)
0.004686548995739501
As expected, they are the same.
Conditional Probability
Conditional probability is a probability that depends on a condition, but that might not be the most helpful definition. Here are some examples:
-
What is the probability that a respondent is a Democrat, given that they are liberal?
-
What is the probability that a respondent is female, given that they are a banker?
-
What is the probability that a respondent is liberal, given that they are female?
Let’s start with the first one, which we can interpret like this: “Of all the respondents who are liberal, what fraction are Democrats?”
We can compute this probability in two steps:
-
Select all respondents who are liberal.
-
Compute the fraction of the selected respondents who are Democrats.
To select liberal respondents, we can use the bracket operator, []
,
like this:
selected
=
democrat
[
liberal
]
selected
contains the values of democrat
for liberal respondents, so
prob(selected)
is the fraction of liberals who are Democrats:
prob
(
selected
)
0.5206403320240125
A little more than half of liberals are Democrats. If that result is lower than you expected, keep in mind:
-
We used a somewhat strict definition of “Democrat”, excluding independents who “lean” Democratic.
-
The dataset includes respondents as far back as 1974; in the early part of this interval, there was less alignment between political views and party affiliation, compared to the present.
Let’s try the second example, “What is the probability that a respondent is female, given that they are a banker?” We can interpret that to mean, “Of all respondents who are bankers, what fraction are female?”
Again, we’ll use the bracket operator to select only the
bankers and prob
to compute the fraction that are female:
selected
=
female
[
banker
]
prob
(
selected
)
0.7706043956043956
About 77% of the bankers in this dataset are female.
Let’s wrap this computation in a function. I’ll
define conditional
to take two Boolean Series
, proposition
and
given
, and compute the conditional probability of proposition
conditioned on given
:
def
conditional
(
proposition
,
given
):
return
prob
(
proposition
[
given
])
We can use conditional
to compute the probability that a respondent is
liberal given that they are female:
conditional
(
liberal
,
given
=
female
)
0.27581004111500884
About 28% of female respondents are liberal.
I included the keyword, given
, along with the parameter, female
, to
make this expression more readable.
Conditional Probability Is Not Commutative
We have seen that conjunction is commutative; that is, prob(A & B)
is
always equal to prob(B & A)
.
But conditional probability is not commutative; that is,
conditional(A, B)
is not the same as conditional(B, A)
.
That should be clear if we look at an example. Previously, we computed the probability a respondent is female, given that they are a banker.
conditional
(
female
,
given
=
banker
)
0.7706043956043956
The result shows that the majority of bankers are female. That is not the same as the probability that a respondent is a banker, given that they are female:
conditional
(
banker
,
given
=
female
)
0.02116102749801969
Only about 2% of female respondents are bankers.
I hope this example makes it clear that conditional probability is not
commutative, and maybe it was already clear to you. Nevertheless, it is
a common error to confuse conditional(A, B)
and conditional(B, A)
.
We’ll see some examples later.
Condition and Conjunction
We can combine conditional probability and conjunction. For example, here’s the probability a respondent is female, given that they are a liberal Democrat:
conditional
(
female
,
given
=
liberal
&
democrat
)
0.576085409252669
About 57% of liberal Democrats are female.
And here’s the probability they are a liberal female, given that they are a banker:
conditional
(
liberal
&
female
,
given
=
banker
)
0.17307692307692307
About 17% of bankers are liberal women.
Laws of Probability
In the next few sections, we’ll derive three relationships between conjunction and conditional probability:
-
Theorem 1: Using a conjunction to compute a conditional probability.
-
Theorem 2: Using a conditional probability to compute a conjunction.
-
Theorem 3: Using
conditional(A, B)
to computeconditional(B, A)
.
Theorem 3 is also known as Bayes’s theorem.
I’ll write these theorems using mathematical notation for probability:
-
is the probability of the conjunction of and , that is, the probability that both are true.
-
is the conditional probability of given that is true. The vertical line between and is pronounced “given”.
With that, we are ready for Theorem 1.
Theorem 1
What fraction of bankers are female? We have already seen one way to compute the answer:
-
Use the bracket operator to select the bankers, then
We can write these steps like this:
female
[
banker
]
.
mean
()
0.7706043956043956
Or we can use the conditional
function, which does the same thing:
conditional
(
female
,
given
=
banker
)
0.7706043956043956
But there is another way to compute this conditional probability, by computing the ratio of two probabilities:
-
The fraction of respondents who are female bankers, and
-
The fraction of respondents who are bankers.
In other words: of all the bankers, what fraction are female bankers? Here’s how we compute this ratio:
prob
(
female
&
banker
)
/
prob
(
banker
)
0.7706043956043956
The result is the same. This example demonstrates a general rule that relates conditional probability and conjunction. Here’s what it looks like in math notation:
And that’s Theorem 1.
Theorem 2
If we start with Theorem 1 and multiply both sides by , we get Theorem 2:
This formula suggests a second way to compute a conjunction: instead of
using the &
operator, we can compute the product of two probabilities.
Let’s see if it works for liberal
and democrat
.
Here’s the result using &
:
prob
(
liberal
&
democrat
)
0.1425238385067965
And here’s the result using Theorem 2:
prob
(
democrat
)
*
conditional
(
liberal
,
democrat
)
0.1425238385067965
They are the same.
Theorem 3
We have established that conjunction is commutative. In math notation, that means:
If we apply Theorem 2 to both sides, we have:
Here’s one way to interpret that: if you want to check and , you can do it in either order:
-
You can check first, then conditioned on , or
-
You can check first, then conditioned on .
If we divide through by , we get Theorem 3:
And that, my friends, is Bayes’s theorem.
To see how it works, let’s compute the fraction of bankers
who are liberal, first using conditional
:
conditional
(
liberal
,
given
=
banker
)
0.2239010989010989
Now using Bayes’s theorem:
prob
(
liberal
)
*
conditional
(
banker
,
liberal
)
/
prob
(
banker
)
0.2239010989010989
They are the same.
The Law of Total Probability
In addition to these three theorems, there’s one more thing we’ll need to do Bayesian statistics: the law of total probability. Here’s one form of the law, expressed in mathematical notation:
In words, the total probability of is the sum of two possibilities: either and are true or and are true. But this law applies only if and are:
-
Mutually exclusive, which means that only one of them can be true, and
-
Collectively exhaustive, which means that one of them must be true.
As an example, let’s use this law to compute the probability that a respondent is a banker. We can compute it directly like this:
prob
(
banker
)
0.014769730168391155
So let’s confirm that we get the same thing if we compute male and female bankers separately.
In this dataset all respondents are designated male or female. Recently, the GSS Board of Overseers announced that they will add more inclusive gender questions to the survey (you can read more about this issue, and their decision, at https://oreil.ly/onK2P).
We already have a Boolean Series
that is True
for female
respondents. Here’s the complementary Series
for male
respondents:
male
=
(
gss
[
'sex'
]
==
1
)
Now we can compute the total probability of banker
like this:
prob
(
male
&
banker
)
+
prob
(
female
&
banker
)
0.014769730168391155
Because male
and female
are mutually exclusive and collectively
exhaustive (MECE), we get the same result we got by computing the
probability of banker
directly.
Applying Theorem 2, we can also write the law of total probability like this:
And we can test it with the same example:
(
prob
(
male
)
*
conditional
(
banker
,
given
=
male
)
+
prob
(
female
)
*
conditional
(
banker
,
given
=
female
))
0.014769730168391153
When there are more than two conditions, it is more concise to write the law of total probability as a summation:
Again, this holds as long as the conditions are
mutually exclusive and collectively exhaustive. As an example,
let’s consider polviews
, which has seven different values:
B
=
gss
[
'polviews'
]
B
.
value_counts
()
.
sort_index
()
1.0 1442 2.0 5808 3.0 6243 4.0 18943 5.0 7940 6.0 7319 7.0 1595 Name: polviews, dtype: int64
On this scale, 4.0
represents “Moderate”. So we can compute the
probability of a moderate banker like this:
i
=
4
prob
(
B
==
i
)
*
conditional
(
banker
,
B
==
i
)
0.005822682085615744
And we can use sum
and a
generator
expression to compute the summation:
sum
(
prob
(
B
==
i
)
*
conditional
(
banker
,
B
==
i
)
for
i
in
range
(
1
,
8
))
0.014769730168391157
The result is the same.
In this example, using the law of total probability is a lot more work than computing the probability directly, but it will turn out to be useful, I promise.
Summary
Here’s what we have so far:
Theorem 1 gives us a way to compute a conditional probability using a conjunction:
Theorem 2 gives us a way to compute a conjunction using a conditional probability:
Theorem 3, also known as Bayes’s theorem, gives us a way to get from to , or the other way around:
The Law of Total Probability provides a way to compute probabilities by adding up the pieces:
At this point you might ask, “So what?” If we have all of the data, we can compute any probability we want, any conjunction, or any conditional probability, just by counting. We don’t have to use these formulas.
And you are right, if we have all of the data. But often we don’t, and in that case, these formulas can be pretty useful—especially Bayes’s theorem. In the next chapter, we’ll see how.
Exercises
Example 1-1.
Let’s use the tools in this chapter to solve a variation of the Linda problem.
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
Linda is a banker.
Linda is a banker and considers herself a liberal Democrat.
To answer this question, compute
-
The probability that Linda is a female banker,
-
The probability that Linda is a liberal female banker, and
-
The probability that Linda is a liberal female banker and a Democrat.
Example 1-2.
Use conditional
to compute the following probabilities:
-
What is the probability that a respondent is liberal, given that they are a Democrat?
-
What is the probability that a respondent is a Democrat, given that they are liberal?
Think carefully about the order of the arguments you pass to
conditional
.
Example 1-3.
There’s a famous quote about young people, old people, liberals, and conservatives that goes something like:
If you are not a liberal at 25, you have no heart. If you are not a conservative at 35, you have no brain.
Whether you agree with this proposition or not, it suggests some
probabilities we can compute as an exercise. Rather than use the
specific ages 25 and 35, let’s define young
and old
as
under 30 or over 65:
young
=
(
gss
[
'age'
]
<
30
)
prob
(
young
)
0.19435991073240008
old
=
(
gss
[
'age'
]
>=
65
)
prob
(
old
)
0.17328058429701765
For these thresholds, I chose round numbers near the 20th and 80th percentiles. Depending on your age, you may or may not agree with these definitions of “young” and “old”.
I’ll define conservative
as someone whose political views
are “Conservative”, “Slightly Conservative”, or “Extremely
Conservative”.
conservative
=
(
gss
[
'polviews'
]
>=
5
)
prob
(
conservative
)
0.3419354838709677
Use prob
and conditional
to compute the following probabilities:
-
What is the probability that a randomly chosen respondent is a young liberal?
-
What is the probability that a young person is liberal?
-
What fraction of respondents are old conservatives?
-
What fraction of conservatives are old?
For each statement, think about whether it is expressing a conjunction, a conditional probability, or both.
For the conditional probabilities, be careful about the order of the arguments. If your answer to the last question is greater than 30%, you have it backwards!
Get Think Bayes, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.