146
7
['lonely city', 'heart piercing wisdom', 'loneliness', 'laing',
'everyone', 'feast later', 'point', 'own hermetic existence in new york',
'danger', 'thankfully', 'lonely city', 'cry for connection',
'overcrowded overstimulated world', 'blueprint of urban loneliness',
'emotion', 'calls', 'city', 'npr jason heller', 'olivia laing',
'lonely city', 'exploration of loneliness',
'others experiences in new york city', 'rumpus', 'review', 'lonely city',
'related posts']
在第
12
章,我们将重用此类,用不一样的
GRAMMAR
构建面向神经网络情感分
类器的自定义关键词袋转换器。
实体抽取
KeyphraseExtractor
类似,可以创建一个自定义特征提取器,将文档转换
为实体袋(
bags-of-entities
)。用
NLTK
的命名实体识别实用程序
ne_chunk
来生成嵌套的解析树结构,其中包含句法类别及每个句子中包含的词性标记。
首先创建用一组实体标签初始化的
EntityExtractor
。然后添加
get_
entities
方法,该方法使用
ne_chunk
来获取给定文档的语法解析树。该方法
会探索解析树中的子树,提取出标签与集合匹配的实体(由人名、组织、机构、
地缘政治实体和地理社会政治实体组成)。将其附加到实体列表中,在遍历
过文档所有树之后将结果用
yield
返回:
from nltk import ne_chunk
GOODLABELS = frozenset(['PERSON', 'ORGANIZATION', 'FACILITY', 'GPE', 'GSP'])
class EntityExtractor(BaseEstimator, TransformerMixin):
def __init__(self, labels=GOODLABELS, **kwargs):
self.labels = labels
def get_entities(self, document):
entities = []
for paragraph in document:
for sentence in paragraph:
trees = ne_chunk(sentence)
for tree in trees:
if hasattr(tree, 'label'):
if tree.label() in self.labels:
entities.append(
' '.join([child[0].lower() for child in tree])
)
return entities

Get 基于Python的智能文本分析 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.