
96
|
第
6
章
我们的布朗解析类:
CorpusParser
分词的关键在于数据源。最重要的一点在于只有导入包含合适信息的数据源分词器模
型才会从中不断学习。首先需要做一些希望它如何工作的假设。我们希望存储每一个
过渡过程,该过程是由词语和标记两个数组组合而成的,然后将该过程封装在名为
CorpusParser::TagWord 的简单类中。初步测试如下:
The Seam of Our Part-of-Speech Tagger: CorpusParser
The seam of a part-of-speech tagger is how you feed it data. The most important
point is to feed it proper information so the part-of-speech tagger can utilize and
learn from that data. First we need to make some assumptions about how we want it
to work. We want to store each transition from a word tag combo in an array of two
and then wrap that array in a simple class called
CorpusParser::TagWord. Our initial
test looks like ...