book

精通数据科学算法

Name: 精通数据科学算法
ISBN: 9781836204596

by Posts & Telecom Press, David Natingga

May 2024

Intermediate to advanced

181 pages

3h 9m

Chinese

Packt Publishing

Read now

Unlock full access

版权信息
版权声明
内容提要
作者简介
致谢
评阅者简介
前言
资源与支持
第1章用k最近邻算法解决分类问题
1.1 Mary对温度的感觉1.2 实现k最近邻算法1.3 意大利地区的示例——选择k值1.4 房屋所有权——数据转换1.5 文本分类——使用非欧几里德距离1.6 文本分类——更高维度的k-NN1.7 小结1.8 习题
第2章朴素贝叶斯
2.1 医疗检查——贝叶斯定理的基本应用2.2 贝叶斯定理的证明及其扩展2.3 西洋棋游戏——独立事件2.4 朴素贝叶斯分类器的实现2.5 西洋棋游戏——相关事件2.6 性别分类——基于连续随机变量的贝叶斯定理2.7 小结2.8 习题

第3章决策树
3.1 游泳偏好——用决策树表示数据3.2 信息论3.3 ID3算法——构造决策树3.4 用决策树进行分类3.5 小结3.6 习题
第4章随机森林
4.1 随机森林算法概述4.2 游泳偏好——随机森林分析法4.3 随机森林算法的实现4.4 下棋实例4.5 购物分析——克服随机数据的不一致性以及度量置信水平4.6 小结4.7 习题
第5章 k-means聚类
5.1 家庭收入——聚类为k个簇5.2 性别分类——聚类分类5.3 k-means聚类算法的实现5.4 房产所有权示例——选择簇的数量5.5 小结5.6 习题
第6章回归分析
6.1 华氏温度和摄氏温度的转换——基于完整数据的线性回归6.2 根据身高预测体重——基于实际数据的线性回归6.3 梯度下降算法及实现6.4 根据距离预测飞行时长6.5 弹道飞行分析——非线性模型6.6 小结6.7 习题
第7章时间序列分析
7.1 商业利润——趋势分析7.2 电子商店的销售额——季节性分析7.3 小结7.4 习题
附录A 统计
A.1 基本概念A.2 贝叶斯推理A.3 分布A.4 交叉验证A.5 A/B测试
附录B R参考
B.1 介绍B.2 数据类型B.3 线性回归
附录C Python参考
C.1 介绍C.2 数据类型C.3 控制流
附录D 数据科学中的算法和方法术语

Content preview from 精通数据科学算法

第5章　k-means聚类

聚类分析是一种将数据划分为多个组（簇）的技术，同一组（簇）中数据的特征在某种意义上是相似的。

本章将会介绍以下内容：

k均值聚类算法在家庭收入案例中的应用；
以性别分类为例，将特征值优先与已知类别的特征值进行聚类，以此实现分类；
5.3节详述了如何用Python实现k-means聚类算法；
房屋所有权案例分析，以及分析如何选择合适的簇数量；
以文档聚类为例，理解簇数量的不同如何影响簇之间分界线的含义。

5.1　家庭收入——聚类为k个簇　

以年收入为4万、 5.5万、 7万、 10万、 11.5万、 13万和13.5万美元的家庭为例。将他们的收入视作（簇内）相似度的衡量标准。如果将家庭分成两个组，那么第一个组包含收入为4万、 5.5万、 7万美元的家庭；第二个组包含收入10万、 11.5万、 13万和13.5万美元。

（这样分类）是因为4万和13.5万离彼此最远，需要有两个组，且它们必须在不同的组中。5.5万比13.5万更接近4万，所以4万和5.5万将在同一个组中。同样，13万和13.5万将在同一个组。7万比13万和13.5万更接近4万和5.5万，所以7万应该在4万和5.5万的组中。11.5万比第一个组的4万、 5.5万和7万更接近13万和13.5万，因此它将在第二个组中。最后，10万更靠近第二个组的11.5万、 13万和13.5万，所以它将在这个组中。因此，第一个组包含年收入为4万、 5.5万和7万的家庭。第二组包含年收入为10万、 11.5万、 13万和13.5万的家庭。

聚类是一种分类形式，它将拥有相似属性值的特征聚到一起并分配到一个簇中。数据科学家需要解释聚类的结果以及它引导的分类形式。年收入为4万、 5.5万、7万美元的家庭代表低收入家庭类别；年收入10万、 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781836204596

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

精通数据科学算法

by Posts & Telecom Press, David Natingga

第5章　k-means聚类

5.1　家庭收入——聚类为k个簇

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.