Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
64
4
另一方面,
tf-idf
会生成一个接近于
0
的缩放因子,如图
4-2
所示。当单词出现在训练集中
的大量文档中时,会出现这种情况,这种单词很可能与目标向量没有很强的相关性。除去
这种单词,可以使解决方案更关注列空间中的其他方向,并找到更好的解(然而准确率的
提高幅度很可能不会很大,因为使用这种方法通常找不到太多能削减的噪声方向)。
特征缩放(包括
2
归一化和
tf-idf
)的真正用武之地是加快解的收敛速度。这表现在它能
使数据矩阵具有明显更少的条件数(最大奇异值和最小奇异值的比值,参见附录
A
中关于
这些名词的详细讨论)。实际上,
2
归一化使得条件数几乎为
1
。但并不是条件数越少,解
就越好。在这次实验中,
2
归一化收敛得比词袋和
tf-idf
都快得多,但它对过拟合更加敏
感:它需要更多的正则化,而且对优化过程中的迭代次数更加敏感。
4.4
 小结
在这一章,我们将
tf-idf
作为切入点,详细分析了特征变换对模型的影响。
tf-idf
是特征缩放
的一个特例,所以我们将它与另一种特征缩放方法——
2
归一化——的效果进行了对比。
结果并不尽如人意。与普通的词袋表示相比,
tf-idf
2
归一化并没有提高最终分类器的
准确率。经过一些统计建模和线性代数分析,我们意识到了原因:它们都没有改变数据矩
阵的列空间。
二者之间有个小区别,那就是
tf-idf
既可以“拉长”单词计数,也可以“压缩”它。换句
话说,它可以使某些计数变大,同时使其他计数接近于
0
。因此,
tf-idf
可以比较彻底地消
除那些没有信息量的单词。
通过这种方法,我们还发现了特征缩放的另一个作用:它可以减少数据矩阵的条件数 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680