Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
数据降维:使用
PCA
挤压数据
91
另一种选择
k
的方法涉及数据集的本征维数。这是个非常模糊的概念,但也可以通过矩阵
的谱来确定。简单地说,如果谱中包含一些非常大的奇异值和一些非常小的奇异值,我们
就可以只保留那些非常大的奇异值,丢弃其余奇异值。有时候,谱中其余的奇异值不是非
常小,但头部和尾部的值之间有比较大的缺口,这也是一个非常合理的界限。这种方法需
要对谱进行人工观察,因此不能作为自动流程的一部分。
PCA
的一种主要诟病是转换过程太复杂了,而且由此得到的结果也难以解释。主成分
和投影向量是实数值,可能是正的,也可能是负的。主成分实质上是(中心化后的)行的
线性组合,投影值则是列的线性组合。例如,在一个股票收益应用中,每个因子都是股票
收益时间片的一个线性组合。其中的含义呢?很难用人类可以理解的理由来解释这些学习
得出的因子。因此,分析师很难相信这些结果。如果不能解释为什么应该把成千上万其他
人的钱投到一支特定的股票上,你可能就不会使用这个模型。
PCA
的计算成本是非常昂贵的,它依赖于
SVD
,而
SVD
就是个对计算能力要求非常高
的过程。要计算出一个矩阵的完整
SVD
,需要
O
(
nd
2
+
d
3
)
次操作(
Golub and Van Loan,
2012
),假设
n
d
,即数据点数量大于特征数量。尽管我们只需要
k
个主成分,计算截
断后的
SVD
k
个最大奇异值及其对应的奇异向量)仍然需要
O
((
n
+
d
)
2
k
) =
O
(
n
2
k
)
次操作。
当有大量数据点和特征时,计算成本令人望而却步。
在流式数据、批量更新或完整数据的抽样中,是难以执行 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680