 45

15

200

3000

50%
。在这

10%
。鉴于这种情况，我们又回到

10%

4-1
：条件概率
first place. Assuming that it takes up to 60 seconds per order to determine whether
its fraudulent or not, and a customer service representative costs around \$15 per
hour to hire, that totals 200 hours and \$3,000 per year.
Another way of approaching this problem would be to construct a probability that an
order is over 50% fraudulent. In this case, wed expect the number of orders wed have
to look at to be much lower. But this is where things become difficult, because the
only thing we can determine is the probability that its fraudulent, which is 10%.
Given that piece of information, wed be back at square one looking at all orders
because its more probable that an order is not fraudulent!
Lets say that we notice that fraudulent orders often use gift cards and multiple pro‐
motional codes. Using this knowledge, how would we determine what is fraudulent
or not—namely, how would we calculate the probability of fraud given that the pur‐
chaser used a gift card?
To answer for that, we first have to talk about conditional probabilities.
Conditional Probabilities
Most people understand what we mean by the probability of something happening.
For instance, the probability of an order being fraudulent is 10%. Thats pretty
straightforward. But what about the probability of an order being fraudulent given
that it used a gift card? To handle that more complicated case, we need something
called a conditional probability, which is defined as follows:
Equation 4-1. Conditional probability
P A B =
P A
B
P B
Probability Symbols
Generally speaking, writing P(E) means that you are looking at the probability of a
given event. This event can be a lot of different things, including the event that A and
B happened, the probability that A or B happened, or the probability of A given B
happening in the past. Here well cover how you’d notate each of these scenarios.
A B is called the intersection function but could also be thought of as the Boolean
operation AND. For instance, in Python it looks like this:
a = [1,2,3]
b = [1,4,5]
set(a) & set(b) #=> 
44 | Chapter 4: Naive Bayesian Classication

P
(
E
)

A

B

A
B

B

A

A
B

AND
。例如，在
Python

a = [1,2,3]
b = [1,4,5]
set(a) & set(b) #=> 
A
B

OR

A
B
。例如，在
Python 46
4
a = [1,2,3]
b = [1,4,5]
set(a) | set(b) #=> [1,2,3,4,5]?

B

A

Python

a = set([1,2,3])
b = set([1,4,5])
total = 6.0
p_a_cap_b = len(a & b) / total
p_b = len(b) / total
p_a_given_b = p_a_cap_b / p_b #=> 0.33

B

A

A
B

B

4-1

4-1：条件概率如何计算
4-1

P
(
A
and
B
)
P
(
B
)

P
(
A
|
B
)

A B could be called the OR function, as it is both A and B. For instance, in Python
it looks like the following:
a = [1,2,3]
b = [1,4,5]
set(a) | set(b) #=> [1,2,3,4,5]
Finally, the probability of A given B looks as follows in Python:
a = set([1,2,3])
b = set([1,4,5])
total = 6.0
p_a_cap_b = len(a & b) / total
p_b = len(b) / total
p_a_given_b = p_a_cap_b / p_b #=> 0.33
This definition basically says that the probability of A happening given that B hap‐
pened is the probability of A and B happening divided by the probability of B. Graph‐
ically, it looks something like Figure 4-1.
Figure 4-1. How conditional probabilities are made
This shows how P(A | B) sits between P(A and B) and P(B).
In our fraud example, let’s say we want to measure the probability of fraud given that
an order used a gift card. This would be:
P Fraud Gi f tcard =
P Fraud Gi f tcard
P Gi f tcard
Now this works if you know the actual probability of Fraud and Giftcard.
At this point, we are up against the problem that we cannot calculate P(Fraud|Gi‐
card) because that is hard to separate out. To solve this problem, we need to use a
trick introduced by Bayes.
Probability Symbols | 45

P
(
Fraud
|
Giftcard
)

Get Python 机器学习实践：测试驱动的开发方法 now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.