Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
372
12
12.5.5
命名规范化
尽管我们的命名消解统一了文章内公司的指称,但这个公司名在各个文章之间仍
然不统一。在某一篇文章中叫做“
Hughes Tool Co
”,而其他文章中则叫“
Hughes
Tool
”。实体链接器可以将不同的实体指称链接到唯一的规范表示,但是由于没有
实体链接器,因此我们将使用(已消解的)命名实体作为唯一的识别符。由于在共
指消解的环节,消解后的名称总是第一次出现的名称,而且通常都是文章中最完整
的指代,因此出错的可能性不会太大。
此外,我们还需要删除公司名称中的后缀“
Co.
”或“
Inc.
”。下列函数使用了一个
正则表达式来处理这个操作:
def strip_legal_suffix(text):
return re.sub(r'(\s+and)?(\s+|\b(Co|Corp|Inc|Plc|Ltd)\b\.?)*$', '', text)
print(strip_legal_suffix('Hughes Tool Co'))
输出结果:
Hughes Tool
流水线的最后一个函数
norm_names
,针对存储在
ref_n
属性的每个已消解共指的组
织名,应用最后的规范化。注意,在这种方式下,“
Hughes
”(
PERSON
)与“
Hughes
ORG
)仍然是不同的实体。
def norm_names(doc):
for t in doc:
if t._.ref_n != '' and t._.ref_t in ['ORG']:
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446