Skip to Content
精通数据科学算法
book

精通数据科学算法

by Posts & Telecom Press, David Natingga
May 2024
Intermediate to advanced
181 pages
3h 9m
Chinese
Packt Publishing
Content preview from 精通数据科学算法

第5章 k-means聚类

聚类分析是一种将数据划分为多个组(簇)的技术,同一组(簇)中数据的特征在某种意义上是相似的。

本章将会介绍以下内容:

  • k均值聚类算法在家庭收入案例中的应用;
  • 以性别分类为例,将特征值优先与已知类别的特征值进行聚类,以此实现分类;
  • 5.3节详述了如何用Python实现k-means聚类算法;
  • 房屋所有权案例分析,以及分析如何选择合适的簇数量;
  • 以文档聚类为例,理解簇数量的不同如何影响簇之间分界线的含义。

以年收入为4万、 5.5万、 7万、 10万、 11.5万、 13万和13.5万美元的家庭为例。将他们的收入视作(簇内)相似度的衡量标准。如果将家庭分成两个组,那么第一个组包含收入为4万、 5.5万、 7万美元的家庭;第二个组包含收入10万、 11.5万、 13万和13.5万美元。

(这样分类)是因为4万和13.5万离彼此最远,需要有两个组,且它们必须在不同的组中。5.5万比13.5万更接近4万,所以4万和5.5万将在同一个组中。同样,13万和13.5万将在同一个组。7万比13万和13.5万更接近4万和5.5万,所以7万应该在4万和5.5万的组中。11.5万比第一个组的4万、 5.5万和7万更接近13万和13.5万,因此它将在第二个组中。最后,10万更靠近第二个组的11.5万、 13万和13.5万,所以它将在这个组中。因此,第一个组包含年收入为4万、 5.5万和7万的家庭。第二组包含年收入为10万、 11.5万、 13万和13.5万的家庭。

聚类是一种分类形式,它将拥有相似属性值的特征聚到一起并分配到一个簇中。数据科学家需要解释聚类的结果以及它引导的分类形式。年收入为4万、 5.5万、7万美元的家庭代表低收入家庭类别;年收入10万、 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学原理

数据科学原理

Posts & Telecom Press, Sinan Ozdemir
PyTorch深度学习

PyTorch深度学习

Posts & Telecom Press, Vishnu Subramanian
程序员学数据结构

程序员学数据结构

Posts & Telecom Press, William Smith
可编程网络自动化

可编程网络自动化

Jason Edelman, Scott S. Lowe, Matt Oswalt

Publisher Resources

ISBN: 9781836204596