
150
|
第
8
章
现在我们的代码应该可以使用了。但需要注意的是,这里面仍有
Unicode
空格,它被
表示为 \u00a0。
现在有一个新问题,那就是所有数据的总和并不等于
1
。我们将引入一个新的函数来
做归一化,它对数值进行散列,并使用 x / sum(x)来处理所有的数值。请注意,我
使用的是分数的形式,这增加了计算的可靠性,直到需要时才进行浮点运算:
Now we have a new problem, though, which is that the data does not add up to 1. We
will introduce a new function,
normalize, which takes a hash of values and applies
the function
x/sum(x) to all values. Note that I used Fraction, which increases the
reliability of calculations and doesn’t do floating-point arithmetic until needed:
class Tokenizer:
# tokenize
@classmethod
def normalize(cls, dist):
sum_values = sum(dist.values())
return {k: Fraction(v, sum_values) for k, v in dist.iteritems()} ...