Skip to Content
精通数据科学算法
book

精通数据科学算法

by Posts & Telecom Press, David Natingga
May 2024
Intermediate to advanced
181 pages
3h 9m
Chinese
Packt Publishing
Content preview from 精通数据科学算法

第4章 随机森林

随机森林由一系列决策树(决策树描述见第3章)组成,每一棵决策树由随机抽取的数据子集产生。通过投票表决的方式,随机森林把特征值归类至得票最多的类中。随机森林可以同时减少偏差和方差,因此,它往往能比决策树提供更加精确的特征分类结果。

本章涵盖内容如下:

  • 装袋法(引导聚类法)是随机森林构建的一部分,它可以推广到数据科学中的其他算法和方法,用于减少偏差和方差,以提高预测结果准确性;
  • 以游泳偏好案例构建随机森林,并用构建出的随机森林对样本数据进行分类;
  • 如何用Python实现随机森林算法;
  • 朴素贝叶斯、决策树和随机森林算法在分析下棋案例时的差异;
  • 通过购物案例,分析随机森林如何克服决策树的不足之处,以及为什么优于决策树算法;
  • 章末练习描述了如何通过减小分类器的方差,以产生更精准的结果。

通常来讲,我们需要在一开始决定所构建决策树的个数。随机森林通常不会产生过拟合问题(噪声数据除外),因此选择构建大量的决策树不会降低预测的准确性。然而,决策树越多,所需的计算能力越高。此外,大幅度地增加随机森林中决策树的个数,分类的准确性并不会提升很大。值得注意的是,在构建决策树过程中,必须有足够多的决策树,使得在随机抽选的时候大部分训练数据能够参与到分类中。

在实践中我们可以运行构建特定数量的决策树的算法,并不断地增加树的个数,比较树少和树多的随机森林的分类结果。如果结果极其相似,则停止增加树的个数。

为了简化示范过程,本书使用包含少量决策树的随机森林。

本节会描述如何以随机抽样的方式构建每棵树。具体地,已知N个训练特征值,通过有放回地从初始数据中随机抽取N个特征数据来构建决策树。随机选择构建每棵树所需数据的过程称为装袋法(树装袋)。采取装袋法的方式抽取训练数据可以减少分类结果的方差和偏差。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学原理

数据科学原理

Posts & Telecom Press, Sinan Ozdemir
PyTorch深度学习

PyTorch深度学习

Posts & Telecom Press, Vishnu Subramanian
程序员学数据结构

程序员学数据结构

Posts & Telecom Press, William Smith
可编程网络自动化

可编程网络自动化

Jason Edelman, Scott S. Lowe, Matt Oswalt

Publisher Resources

ISBN: 9781836204596