Skip to Content
文本挖掘:基于R 语言的整洁工具
book

文本挖掘:基于R 语言的整洁工具

by Julia Silge, David Robinson
March 2018
Intermediate to advanced
170 pages
3h 48m
Chinese
China Machine Press
Content preview from 文本挖掘:基于R 语言的整洁工具
33
3
分析词和文件频率:tf-idf
文本挖掘和自然语言处理的核心问题是如何量化文档内容。通过查看构成文档的单词可
以做到这一点吗?正如第 1 章提到的,一种度量单词重要程度的指标是词项频率(tf),
即单词在文档中出现的频率。然而,文档中有一些单词尽管出现的频率很高,但可能并
不重要,例如,英语中的“ this”“ is”“ of”,等等。可以先将这些词添加到列表中,并
在分析之前删除这些词,但也要注意,在某些文档中有时候这些词比其他词更重要。使
用停用词列表调整常用词项频率的方法并不复杂。
另一种方法是查看词项的逆文档频率(
idf, inverse document frequency
),即减少文档集合
中常用单词的权重,并增加不常用单词的权重。反转文件频率可以与词频进行组合,得
tf-idf(两个变量相乘),如果单词在一篇文档中出现的频率高,并且在其他文档中很
少出现,则需要调整词项频率。
统计量 tf-idf 可以用来评估单词对文档集合(或语料库)中文档的重要性,例
如,小说集中的一本小说或一组网站中的一个网站。
统计量 tf-idf 是一个基于经验或启发式规则的量,虽然该方法已经证明在文本挖掘、搜
索引擎等方面是有用的,但信息理论专家认为这种方法缺少足够的、令人信服的理论基
础。任何给定单词的逆文档频率的定义为:
idf
(term)= ln
(
n
documents
n
documents containing term
)
可以使用第 1 章提到的整洁数据原则来进行 tf-idf 分析,并使用一致的 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

算法技术手册(原书第2 版)

算法技术手册(原书第2 版)

George T.Heineman, Gary Pollice, Stanley Selkow
数字化转型:企业破局的34 个锦囊

数字化转型:企业破局的34 个锦囊

Gary O’Brien, Xiao Guo, Mike Mason

Publisher Resources

ISBN: 9787111588559