Skip to Content
Spark快速大数据分析(第2版)
book

Spark快速大数据分析(第2版)

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
November 2021
Intermediate to advanced
340 pages
10h 46m
Chinese
Posts & Telecom Press
Content preview from Spark快速大数据分析(第2版)
结构化流处理
205
where()
)都独立处理每条输入记录,不需要前序行的任何信息。这种不依赖前序输入
数据的性质决定了这些操作都属于无状态操作。
只有无状态操作的流式查询支持追加输出模式和更新输出模式,但是不支持完整输出模
式。这是有道理的:因为这样的查询处理所得到的每行输出都不会因后来的数据而发生变
化,所以能够以追加模式写到各种输出池(包括只支持追加模式的输出池,如任意格式的
文件输出)。这样的查询天然地不会跨输入记录进行组合,因此结果中的数据量不会减少。
之所以无法支持完整模式,是因为存储不断增长的结果数据通常代价巨大。这与有状态转
化操作有着显著区别,我们接下来会讨论。
8.5.3
 有状态转化操作
有状态操作的最简示例就是
DataFrame.groupBy().count()
,这会生成自查询启动起收到的
记录条数的实时统计数据。在每个微型批中,增量执行计划会将新记录的计数结果加到前
一个微型批生成的计数结果上。这种在计划序列间传输的不完整计数结果就是状态。这个
状态维护在
Spark
执行器的内存中
,会在写检查点时写到配置的检查点位置,以实现容错。
虽然
Spark SQL
会自动管理这一状态的生命周期以确保结果正确
,但一般还是需要做一些
调整来控制维护状态所用的资源总量。本节将探索各种有状态算子如何在内部分别管理其
状态。
1.
分布式且容错的状态管理
1
章和第
2
章介绍过
,集群中运行的
Spark
应用有一个驱动器和一个以上的执行器。
Spark
调度器在驱动器内运行
,将用户的高层操作分解为多个小任务,并将任务放入队 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Go程序设计语言

Go程序设计语言

艾伦A. A.多诺万, 布莱恩W. 柯尼汉
数据压缩入门

数据压缩入门

Colt McAnlis, Aleks Haecky
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115576019