Skip to Content
Spark机器学习实战
book

Spark机器学习实战

by Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei
May 2024
Beginner to intermediate
549 pages
8h 11m
Chinese
Packt Publishing
Content preview from Spark机器学习实战

第3章 Spark机器学习的三剑客

在这一章,将讨论以下内容:

  • 使用Spark 2.0的内部数据源创建RDD;
  • 使用Spark 2.0的外部数据源创建RDD;
  • 使用Spark 2.0的filter() API转换RDD;
  • 使用非常实用的flatMap() API转换RDD;
  • 使用集合操作API转换RDD;
  • 使用groupBy()和reduceByKey()函数对RDD转换/聚合;
  • 使用zip() API转换RDD;
  • 使用paired键值RDD进行关联转换;
  • 使用paired键值RDD进行汇总和分组转换;
  • 根据Scala数据结构创建DataFrame;
  • 不使用SQL方式创建DataFrame;
  • 根据外部源加载和设置DataFrame;
  • 使用标准SQL语言(即SparkSQL)创建DataFrame;
  • 使用Scala序列处理Dataset API;
  • 根据RDD创建和使用Dataset,再反向操作;
  • 使用DatasetAPI和SQL一起处理JSON;
  • 使用领域对象对Dataset API进行函数式编程。

Spark有效处理大规模数据的3个主要工具是RDD、DataFrame和Dataset API。虽然每个API都有自己的优点,但新范式转变支持Dataset作为统一数据API,以满足在单个界面中所有数据处理需求。

新的Spark 2.0 Dataset API是一个类型安全的领域对象集合,可以使用函数运算或关系操作方式执行(类似于RDD的filter、map和flatMap()等)并行转换。为了向后兼容,Dataset有一个称为DataFrame的视图,它是无类型的行集合。在本章中,我们将演示3个API集。图3-1总结了Spark用于数据处理的关键组件的优缺点。

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

TensorFlow深度学习项目实战

TensorFlow深度学习项目实战

Posts & Telecom Press, Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur
Python和NLTK实现自然语言处理

Python和NLTK实现自然语言处理

Posts & Telecom Press, Nitin Hardeniya
Python计算机视觉和自然语言处理

Python计算机视觉和自然语言处理

Posts & Telecom Press, Álvaro Morena Alberolaï, Gonzalo Molina Gallegoï, Unai Garay Maestreï
数据科学实战手册

数据科学实战手册

Posts & Telecom Press, Tony Ojeda, Sean Patrick Murphy, Bengfort Benjamin

Publisher Resources

ISBN: 9781836201830