Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第8章 信息检索——访问信息

信息检索是自然语言处理的其中一种应用。将信息检索定义为对用户做出的查询做出响应,检索信息的过程(例如,单词Ganga在文档中出现的次数)。

本章包括以下主题。

  • 信息检索。
  • 停用词删除。
  • 利用向量空间模型进行信息检索。
  • 向量空间评分以及与查询操作器交互。
  • 利用隐含语义索引开发检索系统。
  • 文本摘要。
  • 询问应答系统。

将信息检索定义为对用户做出的查询进行响应并检索出最合适的信息的过程。在信息检索中,根据元数据或基于上下文的索引,进行搜索。谷歌搜索是信息检索的一个示例,对于每个用户的查询,它基于所使用的信息检索算法,做出响应。信息检索算法中使用了称为倒排索引的索引机制。IR系统建立了索引倒排列表(index postlist),以执行信息检索任务。

布尔检索是在倒排列表上应用布尔运算获得相关信息的信息检索任务。

信息检索任务的正确性由精准率和召回率来衡量。

假设当用户发出查询时,给定IR系统返回X文档,而需要返回的实际或目标文档集是Y

将召回率R定义为系统发现目标文档的百分比(定义为正报样本与正报样本和漏报样本总和的比值)。

R = (XY)/Y

将精准率P定义为IR系统检测到正确文档的百分比。

P = (XY)/X

F值定义为精准率和召回率的调和平均值。

F = 2 (XY)/(X +Y)

在执行信息检索时,检测和删除文档中的停用词非常重要。

查看下面的代码,这段NLTK代码提供了可以在英文文本中检测得到的停用词的集合:

>>> import nltk >>> fromnltk.corpus import stopwords >>> stopwords.words('english') ['i', ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451