Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第2章 替换和校正单词

本章将介绍以下内容。

  • 词干提取。
  • 使用WordNet进行词形还原。
  • 基于匹配的正则表达式替换单词。
  • 移除重复字符。
  • 使用Enchant进行拼写校正。
  • 替换同义词。
  • 使用反义词替换否定形式。

本章将讲述不同的单词替换和校正技术。这些技巧涉及语言的压缩、拼写校正以及文本标准化等。所有的这些方法都可用于搜索索引、文档分类和文本分析前的预处理。

词干提取是一种删除单词词缀从而得到词干的技术。例如,cooking的词干是cook,好的词干提取算法知道要移除ing。词干提取最常见应用在搜索引擎上,以得到索引词。搜索引擎不存储单词的所有形式,而是仅仅存储词干,这极大地减小了索引占用的空间,同时提高了检索的准确度。

其中一种最常见的词干提取算法是由马丁· 波特(Martin Porter)设计的波特词干提取算法(Porter stemming algorithm)。马丁· 波特设计了这种算法,用于移除和更换众所周知的英语单词词缀,下一节将探讨它在NLTK中的用法。

提示.tif 提示: 

所得到的词干并不都是完整的单词。例如,cookery的词干是cookeri,这是一个特征,而不是一个错误。

NLTK自带波特词干提取算法的实现,该算法非常容易使用。简单地实例化PorterStemmer类,将要进行词干提取的单词作为参数调用stem()方法。

>>> from nltk.stem import PorterStemmer >>> ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451