Skip to Content
Spark快速大数据分析(第2版)
book

Spark快速大数据分析(第2版)

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
November 2021
Intermediate to advanced
340 pages
10h 46m
Chinese
Posts & Telecom Press
Content preview from Spark快速大数据分析(第2版)
26
2
窄转化与宽转化
如前文所述,转化操作是
Spark
惰性求值的操作。惰性求值方案的最大优势是,
Spark
以分析整个计算查询,然后弄明白如何优化计算步骤。这里的优化包括将一些操作连起来
放在一个执行阶段中进行管道化执行,或者根据是否需要跨集群节点进行数据交换或混
洗,将操作分为多个执行阶段来执行。
根据依赖关系属于
窄依赖
还是
宽依赖
,转化操作可以分为两类。如果输出中的单个数据分
区是由单个输入分区计算得来的,那么这样的转化操作就称为
窄转化
。比如,在前面的代
码片段中,
filter()
contains()
就属于窄转化,因为它们在每个数据分区上的操作是独
立的,生成输出的数据分区时不需要跨分区交换任何数据。
然而,
groupBy()
orderBy()
会产生
宽转化
操作。宽转化要从其他分区读取数据并进行整
合,可能要写入硬盘。因为每个分区对于存在单词“
Spark
”的行都有计数的值
,所以求
总计数(或者调用
groupBy()
)会要求从每个执行器上将来自各个分区的数据跨集群混洗。
在这个转化操作中,
orderBy()
需要来自其他分区的输出数据来完成最终的聚合。
2-7
展示了这两种类型的转化操作。
窄转化 宽转化
2-7:窄转化与宽转化
2.5
 
Spark UI
Spark
包含一个图形用户界面,可用于以各种粒度(作业、执行阶段、任务)检查或监控
Spark
应用
。根据
Spark
的部署方式,驱动器会启动基于网页的用户界面(
user interface
UI
,默认在端口
4040
上运行。你可以在这个界面上查看如下指标和详细信息: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据驱动力:企业数据分析实战

数据驱动力:企业数据分析实战

Carl Anderson
数据压缩入门

数据压缩入门

Colt McAnlis, Aleks Haecky
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115576019