Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第1章 标记文本和WordNet的基础

本章将介绍以下内容。

  • 将文本标记成句子。
  • 将句子标记成单词。
  • 使用正则表达式标记语句。
  • 训练语句标记生成器。
  • 在已标记的语句中过滤停用词。
  • 查找WordNet中单词的Synset。
  • 在WordNet中查找词元和同义词。
  • 计算WordNet和Synset的相似度。
  • 发现单词搭配。

自然语言工具包(Natural Language ToolKit,NLTK)是进行自然语言处理和文本分析的综合Python库。最初,人们设计NLTK用于教学,现在由于NLTK的实用性和覆盖广度,它在工业研究和开发中得到了广泛应用。NLTK通常用于快速制作出文字处理程序的原型,甚至可以在生产应用中使用。关于选择NLTK功能和可直接用于生产的API的演示,参见text-processing网站。

本章将介绍标记文本和使用WordNet的基本知识。标记化是将一段文本切分成许多片段(如句子和单词)的一种方法。在此后的几章中,这基本上是许多方法的第一步。WordNet是专为自然语言处理系统进行编程访问所设计的字典。这包括以下用例。

  • 寻找单词的定义。
  • 找到同义词和反义词。
  • 探索单词之间的关系和相似度。
  • 对具有多种用法和定义的单词进行词义消歧。

NLTK包括了WordNet语料库读取器,我们将使用这个读取器访问和探索WordNet。语料库就是一堆文本,我们设计语料库读取器使得访问语料库比直接访问文件要容易得多。在后面的章节中,我们将再次使用WordNet,因此,读者自己首先要熟悉基本知识是很重要的。

标记化是将字符串分割成一串标记或片段的过程。标记就是找到整体中的一个部分,因此单词就是语句中的标记。语句是段落的标记。我们将从句子标记化开始,或从将段落拆分成一串语句开始。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451