Skip to Content
面向数据科学家的实用统计学
book

面向数据科学家的实用统计学

by Peter Bruce, Andrew Bruce
October 2018
Beginner to intermediate
238 pages
6h 32m
Chinese
Posts & Telecom Press
Content preview from 面向数据科学家的实用统计学
无监督学习
205
层次聚类的灵活性是有一定代价的,它不能很好地扩展到具有数百万条记录的大规模数据
集上。即便是只有数万条记录的中等规模数据集,层次聚类可能也需要大量的计算资源。
事实上,层次聚类的大部分应用都集中在一些规模相对较小的数据集上。
7.3.1
 一个简单的例子
我们将层次聚类应用于一个具有
n
条记录和
p
个变量的数据集。其中,我们使用了两个基
本度量。
距离度量
d
i
,
j
测量两个记录
i
j
之间距离。
相异性度量
D
A
,
B
,基于每个类内成员间的距离
d
i
,
j
,测量两个类
A
B
间的差异。
对于使用数值型数据的应用,关键在于如何选择相异性度量。层次聚类首先使每个记录独
自构成一个类,然后迭代地合并相异性最低的类。
R
语言中,可以使用 hclust 函数执行层次聚类。hclust 函数和 kmeans 函数的一大区
别是,它并非运行在数据本身之上,而是运行于成对记录的距离
d
i
,
j
之上。我们可以使用
dist 函数分别计算所有数据对间的距离。例如,下面的代码对一组公司的股票收益数据应
用层次聚类。
syms1 <- c('GOOGL', 'AMZN', 'AAPL', 'MSFT', 'CSCO', 'INTC', 'CVX',
'XOM', 'SLB', 'COP', 'JPM', 'WFC', 'USB', 'AXP',
'WMT', 'TGT', 'HD', 'COST')
# 下面执行转置操作。因为要按照公司聚类,所以需要股票数据按行排列
df <- t(sp500_px[row.names(sp500_px)>='2011-01-01', ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

C++语言导学(原书第2版)

C++语言导学(原书第2版)

本贾尼 斯特劳斯特鲁普
基于Python的智能文本分析

基于Python的智能文本分析

Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda

Publisher Resources

ISBN: 9787115493668