152
7

n
-gram
，可以用
n
-gram

n
-gram

I
got lost in the corn maze during the fall picnic
（我在秋季野餐期间迷失在玉米

'in'
'the'
'corn'
），这不是典型的介词短语，

'I', 'got', 'lost'
)

'during'

'the'
'fall'
）出现的可能性有多大？可以通过计算给定
n
-gram

(
n
-1)-gram

n
-gram
('corn', 'maze')

n
-gram

n
-gram

NLTK

CollocationFinder
，用于查找和

n
-gram

NgramAssocMeasures

n

bigram
trigram

5-gram

n
-gram

153
"""

path

"""
# Create a collocation ranking utility from corpus words.
# Rank collocations by an association metric
scored = ngrams.score_ngrams(metric)
if path:
#

with open(path, 'w') as f:
f.write("Collocation\tScore ({})".format(metric.__name__))
for ngram, score in scored:
f.write("{}\t{}\n".format(repr(ngram), score))
else:
return scored

)

Collocation Score (likelihood_ratio)
('New', 'York', "'", 's') 156602.26742890902
('pictures', 'of', 'the', 'Earth') 28262.697780596758
('the', 'majority', 'of', 'users') 28262.36608379526
('numbed', 'by', 'the', 'mindlessness') 3091.139615301832
('There', 'was', 'a', 'time') 3090.2332736791095

NLTK

T

Pearson

Poisson-Stirling

Jaccard

Bigram

Phi-square
Pearson

Fisher

Dice

154
7

7-2

N-Gram感知

7
-
2
：一种 n-gram 征提取流水线

fit()

transform()

FeatureUnion

from sklearn.base import BaseEstimator, TransformerMixin
class SigniﬁcantCollocations(BaseEstimator, TransformerMixin):
def __init__(self,
self.ngram_class = ngram_class
self.metric = metric
def ﬁt(self, docs, target):
ngrams = self.ngram_class.from_documents(docs)
self.scored_ = dict(ngrams.score_ngrams(self.metric))
def transform(self, docs):
for doc in docs:
ngrams = self.ngram_class.from_words(docs)
yield {
ngram: self.scored_.get(ngram, 0.0)