Skip to Content
R大数据分析实用指南
book

R大数据分析实用指南

by Posts & Telecom Press, Simon Walkowiak
May 2024
Intermediate to advanced
387 pages
6h 29m
Chinese
Packt Publishing
Content preview from R大数据分析实用指南

第8章 R语言大数据机器学习

到目前为止,我们在本书中已经探索了各种描述性和诊断性的统计方法,可以很容易地应用到内存消耗大的数据源。但现代数据科学的真正潜力在于其预测性和规范性。为了利用它们,全面的数据科学家应该理解机器学习算法的技术和方法,以及其中的逻辑和实现。在这一章中,我们将通过R语言的语法向你介绍适用于大数据分类和聚类问题的机器学习方法。此外,本章内容将为你提供以下技能。

  • 理解机器学习的概念,并且能够区分监督/无监督方法和聚类/分类模型。
  • 在多节点Spark HDInsight集群上通过SparkR包调用Spark MLlib模块来执行高性能广义线性模型(Generalized Linear Model)
  • 使用朴素贝叶斯分类算法,并且使用H2O平台设计一个深度学习神经网络(Deep Learning),H2O是一个开源的大数据分布式机器学习平台,通过H2O软件包与R连接,来预测真实数据的事件类别。
  • 学习评估选择机器学习算法的性能指标和精度指标。

我们将首先简要介绍机器学习的概念,介绍最常用的预测算法、分类模型和典型特征。我们还将给出一些资源列表,你可以在其中找到所选算法相关细节的更多信息。我们会指导你了解越来越多的数据科学家的大数据机器学习工具。

机器学习方法封装了数据挖掘和统计技术,让研究人员理解数据,对变量或特征之间的关系建模,并扩展这些模型以预测未来的事件的值或类别。那么这和众所周知的统计检验有何不同呢?一般来说,我们可以说,机器学习方法对于数据的格式和特性要求不太严格;也就是说,当预测连续响应变量的结果时,许多机器学习算法不要求该变量的残差服从正态分布。大多数统计检验更侧重于推理和假设检验,特别是在计算一个一般统计量(例如,方差分析或回归中的F统计量)的情况下,而机器学习模型则试图利用所观察到的模式来解释和预测未来的数据。事实上,两种概念之间存在很大的重叠,许多技术可以分为机器学习和统计检验两种。正如我们后面会看到的,它们也使用相似的诊断测试来评估识别的模式和模型的泛化性,例如均方差或R2。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Python高级编程(第2版)

Python高级编程(第2版)

Posts & Telecom Press, Michał Jaworski, Tarek Ziadé
PyTorch深度学习

PyTorch深度学习

Posts & Telecom Press, Vishnu Subramanian
精通Spark数据科学

精通Spark数据科学

Posts & Telecom Press, Andrew Morgan, Antoine Amend, David George, Matthew Hallett
Python无监督学习

Python无监督学习

Posts & Telecom Press, Giuseppe Bonaccorso

Publisher Resources

ISBN: 9781836205791