Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
非线性特征化与
k
-
均值模型堆叠
95
将高维数据压缩为低维数据,常用于二维空间或三维空间中的可视化。
7-1:瑞士卷,一个非线性流形
但是,尽量降低特征维度只是特征工程目标的一小部分,它的根本目标还是为当前任务找
正确
的特征。在本章中,正确的特征是那些能表示出数据的空间特性的特征。
聚类算法通常不被用作局部结构学习技术,但实际上它完全可以胜任。彼此相近(可以用
一种特定的度量方式来定义“近”的概念)的点属于同一个簇。给定一个聚类,数据点可
以用它的簇成员向量来表示。如果簇的数量小于初始的特征数量,那么相对于初始表示,
这种新表示就具有更少的维度,初始数据就被压缩进一个更低维度的空间。本章将解释这
种思想。
与非线性嵌入技术相比,聚类会生成更多特征。但如果最终目标是特征工程,而不是可视
化,这就不是问题了。
我们将通过一种称为
k
-
均值的常用聚类算法来说明局部结构学习的思想,这种方法简单易
行。与其说
k
-
均值方法的作用是非线性流形降维,还不如说它执行了
非线性流形特征提
。使用正确的话,
k
-
均值聚类可以成为特征工程的一项神兵利器。
7.1
 
k
-
均值聚类
k
-
均值是一种聚类算法。聚类算法根据数据在空间中的分布方式为其分组。聚类是一种
监督
学习方法,它不需要任何形式的标签——这种算法的目的就是仅基于数据本身的结构
推测出簇标签。
聚类算法依赖于
度量方式
,即对数据点之间相近程度的测量。最常用的度量方式是欧氏距
离,或称欧几里得度量,它来自于欧氏几何,测量的是两点之间的直线距离。这种度量方
96
7
式对我们来说非常正常,因为这就是现实世界中随处可见的距离。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680