Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
准备统计和机器学习的文本数据
115
4.4.3
案例:使用
textacy
规范化字符
请看下面的句子,其中包含由字母变体和引号字符引发的常见问题:
text = "The café
Saint-Raphaël
is loca-\nted on Côte d
Azur."
重音字符可能会引发问题,因为人们使用重音的方法并不一致。例如,词语“
Saint-
Raphaël
”和“
Saint-Raphael
”不会被当成同一个词。此外,文本中的单词常常因
为自动换行而不得不通过连字符连接起来。文本中使用的五花八门的
Unicode
连字
符和单引号(如上述文本中的连字符和单引号)也可能引发分词的问题。我们应该
针对这些问题规范化文本,并使用相应的
ASCII
字符替换重音字符和五花八门的字
符。
为此,我们将使用
textacy
https://textacy.readthedocs.io
)。
textacy
是一个可与
spaCy
配合使用的
NLP
库。
spaCy
负责处理语言,而
textacy
则专注于预处理和后处理。
textacy
的预处理模块包含一系列实用的函数,可用于规范化字符并处理常见的模式,
例如
URL
、电子邮件地址、电话号码等,接下来我们就会用到这些函数。表
4-1
举了
textacy
的各种预处理函数。所有这些函数均可用于处理纯文本,而且完全不依
spaCy
4-1textacy 的预处理函数
函数
说明
normalize_hyphenated_words
重新组合被换行符分割的单词
normalize_quotation_marks ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446