Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
44
1
1.10
小结
在本章中,我们介绍了文本数据分析开头的一些工作。准备文本和分词的过程应该
保持简洁,以快速获得结果。我们将在第
4
章中介绍更复杂的方法,并讨论不同方
法的优缺点。
数据探索不仅可以让我们对数据有初步的了解,而且还有助于我们建立对数据的信
心。你应该记住:遇到任何词语突然出现异常,都应该刨根问底,挖掘根本原因。
KWIC
分析是搜索这类词语的一款好工具。
在内容分析方面,我们介绍了一些词频分析的案例。词条的加权既可以基于单个词
条,也可以基于词条频率与逆向文档频率(
TF-IDF
)的组合。我们将在第
5
章中详
细介绍这些概念,因为
TF-IDF
加权是面向机器学习的文档向量化的标准方法。
还有很多文本分析的方面,本章未能逐一介绍,其中包括:
与作者相关的信息可以帮助你找出有影响力的作家(如果这是项目的目标之一
的话),你可以通过活动、社交评分、写作风格等来甄别作家。
有时,你可以通过可读性来比较同一个主题的作者或不同的语料库。
textacy
https://oreil.ly/FRZJb
)提供了一个名叫
textstats
的函数,只需遍历一次文
本就可以计算出可读性评分以及其他统计信息。
Jason Kessler
Scattertext
https://oreil.ly/R6Aw8
)库是一个很有趣的工具,
能够识别和可视化类别(例如政党)之间不同的词条。
除了普通的
Python
外,你还可以使用交互式可视化工具进行数据分析。微软的
PowerBI
有一个很不错的词云插件,而且还有许多其他选项可以生成交互式图表。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446