朴素贝叶斯分类
45
欺诈性的,并且雇用客户服务代表相应的开销约为每小时
15
美元,那么每年总共需
要花费
200
小时和
3000
美元。
解决这个问题的另一种方法是想办法确定一个订单是欺诈订单的概率超过
50%
。在这
种情况下,我们希望减少必须要查看的订单数量。但这正是使得事情变得困难的地方,
因为我们唯一可以确定的是它是欺诈订单的概率为
10%
。鉴于这种情况,我们又回到
了原点,需要检查所有订单,因为它们不是欺诈订单的可能性更大。
假设我们知道欺诈订单通常使用礼品卡和多个促销代码。通过这些信息,我们如何确
定一个订单是否为欺诈订单,也就是说我们如何根据购买者使用礼品卡这个条件来确
定这个订单是一个欺诈订单的概率?
要回答这个问题,我们首先要来谈谈条件概率。
条件概率
大多数人理解我们所说的某一事件发生的概率。例如,一个订单是欺诈订单的概率是
10%
这很简单直接。但是对应一个使用礼品卡的订单,它是欺诈性的概率是多少呢?
为了处理更复杂的问题,我们需要引入条件概率,定义如下:
公式
4-1
:条件概率
first place. Assuming that it takes up to 60 seconds per order to determine whether
its fraudulent or not, and a customer service representative costs around $15 per
hour to hire, that totals 200 hours and $3,000 per year.
Another way of approaching this problem would be to construct a probability that an
order is over 50% fraudulent. In this case, wed expect the number of orders wed have
to look at to be much lower. But this is where things become difficult, because the
only thing we can determine is the probability that its fraudulent, which is 10%.
Given that piece of information, wed be back at square one looking at all orders
because its more probable that an order is not fraudulent!
Lets say that we notice that fraudulent orders often use gift cards and multiple pro‐
motional codes. Using this knowledge, how would we determine what is fraudulent
or not—namely, how would we calculate the probability of fraud given that the pur‐
chaser used a gift card?
To answer for that, we first have to talk about conditional probabilities.
Conditional Probabilities
Most people understand what we mean by the probability of something happening.
For instance, the probability of an order being fraudulent is 10%. Thats pretty
straightforward. But what about the probability of an order being fraudulent given
that it used a gift card? To handle that more complicated case, we need something
called a conditional probability, which is defined as follows:
Equation 4-1. Conditional probability
P A B =
P A
B
P B
Probability Symbols
Generally speaking, writing P(E) means that you are looking at the probability of a
given event. This event can be a lot of different things, including the event that A and
B happened, the probability that A or B happened, or the probability of A given B
happening in the past. Here well cover how you’d notate each of these scenarios.
A B is called the intersection function but could also be thought of as the Boolean
operation AND. For instance, in Python it looks like this:
a = [1,2,3]
b = [1,4,5]
set(a) & set(b) #=> [1]
44 | Chapter 4: Naive Bayesian Classication
概率符号
一般来说,
P
(
E
)
用来表示某一给定事件的概率。这个事件可以是很多不同的事情,包
括事件
A
与事件
B
同时发生,
A
B
其中之一发生的概率,或者在
B
发生的情况下
A
发生的概率。在这里,我们将介绍如何用符号表示这些场景。
A
B
称为交集函数,但也可以被认为是布尔运算
AND
。例如,在
Python
中它看起
来是这样的:
a = [1,2,3]
b = [1,4,5]
set(a) & set(b) #=> [1]
A
B
可以被称为并集函数(
OR
函数),它包含
A
B
。例如,在
Python
中看起来
是这样的:
46
4
a = [1,2,3]
b = [1,4,5]
set(a) | set(b) #=> [1,2,3,4,5]?
最后,给定
B
条件下
A
的概率,在
Python
中如下所示:
a = set([1,2,3])
b = set([1,4,5])
total = 6.0
p_a_cap_b = len(a & b) / total
p_b = len(b) / total
p_a_given_b = p_a_cap_b / p_b #=> 0.33
这个定义的基本意思是:在给定
B
发生的条件下
A
发生的概率为
A
B
同时发生
的概率除以
B
发生的概率,如图
4-1
所示。
4-1:条件概率如何计算
4-1
显示了如何使用
P
(
A
and
B
)
P
(
B
)
来计算出
P
(
A
|
B
)
在欺诈订单案例中,假设我们想要计算一个订单在使用了礼品卡的情况下是欺诈订单
的概率。计算方法如下:
A B could be called the OR function, as it is both A and B. For instance, in Python
it looks like the following:
a = [1,2,3]
b = [1,4,5]
set(a) | set(b) #=> [1,2,3,4,5]
Finally, the probability of A given B looks as follows in Python:
a = set([1,2,3])
b = set([1,4,5])
total = 6.0
p_a_cap_b = len(a & b) / total
p_b = len(b) / total
p_a_given_b = p_a_cap_b / p_b #=> 0.33
This definition basically says that the probability of A happening given that B hap‐
pened is the probability of A and B happening divided by the probability of B. Graph‐
ically, it looks something like Figure 4-1.
Figure 4-1. How conditional probabilities are made
This shows how P(A | B) sits between P(A and B) and P(B).
In our fraud example, let’s say we want to measure the probability of fraud given that
an order used a gift card. This would be:
P Fraud Gi f tcard =
P Fraud Gi f tcard
P Gi f tcard
Now this works if you know the actual probability of Fraud and Giftcard.
At this point, we are up against the problem that we cannot calculate P(Fraud|Gi‐
card) because that is hard to separate out. To solve this problem, we need to use a
trick introduced by Bayes.
Probability Symbols | 45
如果你知道欺诈的概率和礼品卡的概率,这种方法是可行的。
这时候,我们又碰到一个问题,即无法计算
P
(
Fraud
|
Giftcard
)
因为它很难分离出来。
为了解决这个问题,我们需要引入贝叶斯的一个技巧。

Get Python 机器学习实践:测试驱动的开发方法 now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.