Skip to Content
Python和NLTK实现自然语言处理
book

Python和NLTK实现自然语言处理

by Posts & Telecom Press, Nitin Hardeniya
February 2024
Intermediate to advanced
649 pages
9h 58m
Chinese
Packt Publishing
Content preview from Python和NLTK实现自然语言处理

第7章 文本分类

本章将介绍以下内容。

  • 词袋特征提取。
  • 训练朴素贝叶斯分类器。
  • 训练决策树分类器。
  • 训练最大熵分类器。
  • 训练scikit-learn分类器。
  • 衡量分类器的准确率和召回率。
  • 计算高信息量单词。
  • 使用投票组合分类器。
  • 使用多个二元分类器分类。
  • 使用NLTK训练器训练分类器。

文本分类是归类文件或文本片段一种方式。通过检查一段文字中的单词用法,分类器可以决定分配给这个单词何种类型标签。二元分类器可以在两个标签(如正或负)之间做决定。文本可以是其中一个标签,但是不能同时拥有两个标签,而多标签分类器可以给一段文本分配一个或多个标签。

分类器在有标签的特征集或训练数据中学习,然后,对没有标签的特征集进行分类。简单来说,有标签的特征集就是一个元组,这个元组看起来像(feat, label),而没有标签的特征集就只有feat本身。特征集基本上就是特征名称到特征值的关键值映射。在文本分类的情况下,特征名通常是单词,值全为True。因为文档可能有未知的单词,可能的单词数目可能非常巨大,因此会省略在文本中未出现的单词,而不是使用值False将它们包含在特征集中。

实例是特征集的另一种说法。它代表单次出现的特征组合。这里将可互换地使用实例和特征集。有标签的特征集是具有已知类标签的实例,可以利用它来训练或评估。总之,(feat, label)是有标签的特征集,或有标签的实例。feat就是特征集,通常表示为关键值字典。当feat没有与之相关联的标签时,它也称为没有标签的特征集,或实例。

文本特征提取是本质上将单词列表转变为特征集的过程,从而使分类器可以使用这个特征集。由于NLTK分类器期望得到dict式的特征集,因此必须将文本变换为dict。词袋模型是最简单的方法。它从一个实例的所有单词中构建出单词出现特征集。这种方法不在乎单词的顺序,或单词出现了多少次,这种方法所关心的是在单词列表中单词是否出现。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
软件开发实践:项目驱动式的Java开发指南

软件开发实践:项目驱动式的Java开发指南

Raoul-Gabriel Urma, Richard Warburton
Spark机器学习实战

Spark机器学习实战

Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Publisher Resources

ISBN: 9781835083451