朴素贝叶斯分类
57
make the subsequent tests work, though, we will have to fill in the skeleton for our
tokenizer module like so:
import re
class Tokenizer:
NULL = u'\u0000'
@staticmethod
def tokenize(string):
return re.findall("\w+", string.lower())
@staticmethod
def ngram(string, ngram):
tokens = Tokenizer.tokenize(string)
ngrams = []
for i in range(len(tokens)):
shift = i-ngram+1
padding = max(-shift,0)
first_idx = max(shift, 0)
last_idx = first_idx + ngram - padding
ngrams.append(Tokenizer.pad(tokens[first_idx:last_idx], padding))
return ngrams
@staticmethod
def pad(tokens, padding):
padded_tokens = []
for i in range(padding):
padded_tokens.append(Tokenizer.NULL)
return padded_tokens + tokens
Now that we have a way of parsing and tokenizing emails, we can move on to build
the Bayesian portion: the SpamTrainer.
SpamTrainer
The SpamTrainer will accomplish three things:
Storing training data
Building a Bayesian classifier
Error minimization through cross-validation
56 | Chapter 4: Naive Bayesian Classication
现在我们有了解析和标记电子邮件的方法,我们可以继续构建贝叶斯部分
SpamTrainer (
垃圾邮件训练器
)
垃圾邮件训练器
垃圾邮件训练器将完成三件事情:
存储训练数据
构建贝叶斯分类器
通过交叉验证最小化错误率
58
4
存储训练数据
我们需要采取的第一步是存储来自给定的电子邮件消息集合中的训练数据。在一个生
产环境中,你会选择一些有持久性的存储方式。在我们的例子中,我们会把所有内容
存储在一个大字典中。
集合(
set
)是具有唯一性的数据的聚集。
请牢记,大多数机器学习算法有两个步骤:训练和计算。我们的训练步骤包含以下子
步骤:
把所有类别存储在一个集合中
为每一个类别存储唯一的单词个数
存储每个类别的总数
首先我们要获取所有类别的名字。该测试如下所示:
Storing training data
The first step we need to tackle is to store training data from a given set of email mes‐
sages. In a production environment, you would pick something that has persistence.
In our case, we will go with storing everything in one big dictionary.
A set is a unique collection of data.
Remember that most machine learning algorithms have two steps: training and then
computation. Our training step will consist of these substeps:
Storing a set of all categories
Storing unique word counts for each category
Storing the totals for each category
So first we need to capture all of the category names; that test would look something
like this:
import unittest
import io
import sets
from naive_bayes.email_object import EmailObject
from naive_bayes.spam_trainer import SpamTrainer
class TestSpamTrainer(unittest.TestCase):
def setUp(self):
self.training = [['spam', './tests/fixtures/plain.eml'], \
['ham', './tests/fixtures/small.eml'], \
['scram', './tests/fixtures/plain.eml']]
self.trainer = SpamTrainer(self.training)
file = io.open('./tests/fixtures/plain.eml', 'r')
self.email = EmailObject(file)
def test_multiple_categories(self):
categories = self.trainer.categories
expected = sets.Set([k for k,v in self.training])
self.assertEqual(categories, expected)
The solution is in the following code:
from sets import Set
import io
from tokenizer import Tokenizer
from email_object import EmailObject
from collections import defaultdict
Spam Filter | 57
解决方案如下列代码所示:

Get Python 机器学习实践:测试驱动的开发方法 now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.