Skip to Content
精通数据科学算法
book

精通数据科学算法

by Posts & Telecom Press, David Natingga
May 2024
Intermediate to advanced
181 pages
3h 9m
Chinese
Packt Publishing
Content preview from 精通数据科学算法

第3章 决策树

决策树是数据在树状结构中的排列,根据节点处属性值的不同,数据将被分到不同的分支中。

本章将使用一个标准的ID3学习算法来构建一个决策树,该算法选择一个数据的属性,以最大化信息增益(一种基于信息熵的度量)为目标对数据样本进行分类。

本章将学习以下内容:

  • 什么是决策树,以及如何将“游泳偏好”例子中的数据用决策树表示;
  • 首先从理论角度说明信息论中信息熵和信息增益的概念,随后将其实际应用于“游泳偏好”例子中;
  • 用Python实现一个ID3算法,并从数据训练开始构造一个决策树;
  • 如何使用在“游泳偏好”例子中构建的决策树来对新的数据项进行分类;
  • 如何使用决策树替代第2章西洋棋游戏中的分析方法,以及两种算法所得的结果有哪些差异;
  • 加深读者对何时使用决策树作为分析方法的理解;
  • 在“购物”例子中,如何处理在建立决策树过程中数据不一致的问题。

例如,人们可能会对何时游泳有一定的偏好。偏好结果记录在表3-1中:

表3-1

泳衣

水温

游泳偏好

None

Cold

No

None

Warm

No

Small

Cold

No

Small

Warm

No

Good

Cold

No

Good

Warm

Yes

这个表中的数据可以用图3-1所示的决策树分支表示。

C:\Users\LL\Desktop\49816\未命名-1-web-resources\image\3-1.png

图3-1

在根节点有这么一个问题:是否有泳衣?问题的答案将可用数据分成3组,每组有两行记录。如果属性“泳衣”为none,则属性“游泳偏好”为no。不需要进一步知道关于水温的偏好了,因为所有属性“泳衣”为none的样本将被分类为no。属性“泳衣”为“small”的情况也是如此。在“泳衣”为“合适”的情况下,剩下的两行记录可以分为两类:no和yes。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学原理

数据科学原理

Posts & Telecom Press, Sinan Ozdemir
PyTorch深度学习

PyTorch深度学习

Posts & Telecom Press, Vishnu Subramanian
程序员学数据结构

程序员学数据结构

Posts & Telecom Press, William Smith
可编程网络自动化

可编程网络自动化

Jason Edelman, Scott S. Lowe, Matt Oswalt

Publisher Resources

ISBN: 9781836204596