Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第1章 使用字符串

自然语言处理(NLP)涉及了自然语言与计算机之间的交互。这是人工智能(AI)和计算语言学期中一个主要组成部分。它提供了计算机和人类之间的无缝交互。在机器学习的帮助下,它赋予了计算机听懂人类讲话的能力。众所周知,在各种编程语言中(例如,C,C ++,JAVA,Python等等),字符串是用来表示文件或文档内容的基本数据类型。在本章中,我们将探讨对字符串的各种操作,这对完成各项NLP任务是非常有用的。

本章包括以下主题。

  • 文本标记化。
  • 文本规范化。
  • 替代和纠正标记。
  • 在文本上应用齐夫定律。
  • 使用编辑距离算法,应用相似性量度。
  • 使用杰卡德的系数,应用相似性量度。
  • 使用史密斯-沃特曼算法,应用相似性量度。

我们将标记化定义为将文本切分成较小部分(标记)的过程,这被认为是自然语言处理中的一个关键步骤。

当安装了NLTK,并且Python IDLE运行时,我们可以进行文本或段落的标记化,将其标记为单个句子。为了执行标记化,我们可以导入句子标记化函数。这个函数的参数是需要进行标记化的文本。sent_tokenize函数使用NLTK的实例,也就是大家熟知的PunktSentenceTokenizer。这个NLTK实例已经得到了训练,可以在不同的欧洲语言上,基于标志着句子的开头和结尾的字母或标点符号,执行标记化。

现在,对于给定的文本,我们来看看如何将它标记成单个的句子。

>>> import nltk >>> text=" Welcome readers. I hope you find it interesting. Please do reply." >>> from nltk.tokenize import sent_tokenize ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451