Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
234
8
念的话,更简单的方法还有非负矩阵分解(
Non-negative Matrix Factorization
,简称
NMF
)、奇异值分解(
singular value decomposition
,有时称作
LSI
)等。
8.1
本章内容概要
在本章中,我们将深入研究主题建模的各种方法,看一看各个方法之间的差异和相
似之处,并用它们来处理同一个用例。你不必拘泥于一种方法,可以根据实际的需求,
比较多种方法的结果。
经过本章的学习,你将了解主题建模的多种方法,以及每一种方法的优缺点。你将
学习如何利用主题建模找出多个主题,以及如何快速创建文档语料库的摘要。你将
看到在计算主题模型的时候,选择正确粒度的实体有多么重要。此外,为了找到最
佳主题模型,你还需要实验许多参数,并通过量化的方法和数值判断主题模型得出
的结果的质量。
8.2
数据集:联合国一般性辩论
我们的用例将针对联合国一般性辩论执行语义分析。前面的章节在介绍文本统计时,
就提到过这个数据集。
这一次,我们更加关注发言的含义以及语义内容,以及如何按照主题来组织各个发
言。我们想知道发言人讨论的内容,并回答一些问题,比如:这个文档语料库是否
有结构?主题是什么?哪个主题最为突出?这个主题会随着时间发生变化吗?
8.2.1
查看语料库的统计信息
在开始主题建模之前,首先我们来查看一下基础文本语料库的统计信息。根据分析
的结果,你可以选择不同的实体进行分析,例如文档、章节或文本的段落等。
我们对作者以及附加信息不太感兴趣,所以只需要处理其中一个
CSV
文件就够了: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446