Skip to Content
数据科学中的实用统计学(第2版)
book

数据科学中的实用统计学(第2版)

by Peter Bruce, Andrew Bruce, Peter Gedeck
October 2021
Intermediate to advanced
289 pages
8h 31m
Chinese
Posts & Telecom Press
Content preview from 数据科学中的实用统计学(第2版)
无监督学习
259
扩展到大数据上。最后,这种算法非常复杂,与其他方法相比,它更加难以掌握。
本节要点
簇被假定为从多个具有不同概率分布的数据生成过程中产生的。
假定有不同数目的分布(通常是正态分布),需要拟合多个不同的模型。
这种方法会选择那种不需要使用太多参数(即过拟合)就能很好地拟合数据的模型
(以及相关的簇数目)。
7.4.4
 扩展阅读
如果想了解基于模型的聚类的更多细节,可以参考
mclust
GaussianMixture
的文档。
7.5
 数据缩放与分类变量
无监督学习技术通常需要对数据进行适当的缩放,这与多数回归和分类技术不同,在回归
与分类中,数据缩放并不重要(
KNN
是一种例外,参见
6.1
)。
本节关键术语
缩放
对数据进行挤压或扩展,通常使多个变量处于同一数量级。
归一化
一种缩放方法——先减去均值,再除以标准差。
同义词
标准化
Gower
距离
一种应用于数值型和分类型数据的混合数据的缩放算法,它把所有变量都转换到
0
1
之间。
举例来说,在个人贷款数据中,变量的单位和数量级都差别非常大。有些变量的值比较小
(如工作年限),而另一些变量的值则非常大(如以美元为单位的贷款额)。如果不进行数
据缩放,那么
PCA
K-
均值和其他聚类方法都会被数值大的变量所主导
,忽略那些数值较
小的变量。
分类数据会在某些聚类过程中造成特殊的问题。在
KNN
,无序的因子变量通常使用独
热编码(参见
6.1.3
节)
转换为一组二元变量(
0/1
)。二元变量很可能与其他数据不在一个
数量级上,而且因为它只有两个值,所以在
PCA
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Python机器学习案例精解

Python机器学习案例精解

Posts & Telecom Press, Yuxi (Hayden) Liu

Publisher Resources

ISBN: 9787115569028