Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
52
4
特征缩放的效果:从词袋到
tf-idf
词袋表示法简单易行,但远非完美。如果我们无差别地对所有单词计数,那么有些单词会
被过分强调,这是根本不必要的。回想一下第
3
章中
Emma
和乌鸦的例子,我们希望有一
种能够强调两个主要角色的文档表示方法。单词“
Emma
”和“
raven
”都出现了
3
次,但
the
”居然出现了
8
次 ,“
and
”也出现了
5
次 ,“
it
”和“
was
”都出现了
4
次。仅通过简单
的词频计数,无法突显出主要角色,这是这种方法的问题所在。
能挑选出像“
magnificently
”“
gleamed
”“
intimidated
”“
tentatively
”和“
reigned
”这样的单
词也是很好的,因为它们有助于确定该段文字的整体基调。它们能体现出情感,这对数据
科学家来说是非常宝贵的信息。所以,理想情况下,我们需要那种能强调出
有意义
的单词
的表示方法。
4.1
 
tf-idf
词袋的一种简单扩展
tf-idf
是在词袋方法基础上的一种简单扩展,它表示
词频
-
逆文档频率
tf-idf
计算的不是
数据集中每个单词在每个文档中的原本计数,而是一个归一化的计数,其中每个单词的计
数要除以这个单词出现在其中的文档数量。即:
bow(
w
,
d
) =
单词
w
在文档
d
中出现的次数
tf-idf(
w
,
d
) = bow(
w
,
d
) *
N
/ (
单词
w
出现在其中的文档数量
)
N
是数据集中的文档总数。分数
N
/ (
单词
w
出现在其中的文档的数量
)
就是所谓的
逆文档
频率
。如果一个单词出现在很多文档中,那么它的逆文档频率就接近于
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680