Skip to Content
Spark:权威指南
book

Spark:权威指南

by Bill Chambers, Matei Zaharia
May 2025
Intermediate to advanced
606 pages
7h 38m
Chinese
O'Reilly Media, Inc.
Content preview from Spark:权威指南

第 14 章 分布式共享变量 分布式共享变量

本作品已使用人工智能进行翻译。欢迎您提供反馈和意见:translation-feedback@oreilly.com

在 除了弹性分布式数据集(RDD)接口之外,Spark 中的第二种底层 API 是两种类型的 "分布式共享变量":广播变量和累加器。这些变量可用于用户自定义函数(例如,在 RDD 或 DataFrame 上的map 函数中),在集群上运行时具有特殊属性。具体来说,累加器可以让你将所有任务的数据加在一起,形成一个共享结果(例如,实现一个计数器,这样你就能看到有多少作业的输入记录解析失败),而广播变量可以让你在所有工作节点上保存一个大值,并在许多 Spark 操作中重复使用,而无需将其重新发送到集群。本章将讨论这些变量类型的动机以及使用方法。

广播变量

广播 变量是一种无需在函数闭包中封装变量即可在集群中高效共享不可变值的方法。在任务内部的驱动节点中使用变量的常规方法是在函数闭包中简单地引用该变量(例如,在map 操作中),但这种方法可能效率不高,尤其是对于查找表或机器学习模型等大型变量。原因在于,在闭包中使用变量时,必须在工作节点上对其进行多次反序列化(每个任务一次)。此外,如果您在多个 Spark 操作和作业中使用同一个变量,那么它将随着每个作业而被重新发送给工作者,而不是一次。

这就是广播变量的用武之地。广播变量是共享的、不可变的变量,它们缓存在集群中的每台机器上,而不是序列化到每个任务中。典型的用例是在执行器的内存中传递一个大型查找表,并在函数中使用,如图 14-1 所示。

figure-1
图 14-1. 广播变量

例如,假设您有一个单词或数值列表:

// in Scala
val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"
  .split(" ")
val words = spark.sparkContext.parallelize(myCollection, 2)
# in Python
my_collection = "Spark The Definitive Guide : Big Data Processing Made Simple"\
  .split(" ")
words = spark.sparkContext.parallelize(my_collection, 2)

您希望用其他信息来补充您的单词列表,这些信息的大小可能是千字节、兆字节,甚至可能是千兆字节。如果我们从 SQL 的角度考虑,这在技术上是一个右连接:

// in Scala
val supplementalData = Map("Spark" -> 1000, "Definitive" -> 200,
                           "Big" -> -300, "Simple" -> 100)
# in Python
supplementalData = {"Spark":1000, "Definitive":200,
                    "Big":-300, "Simple":100}

我们可以在 Spark 中广播这个结构,并通过使用suppBroadcast 来引用它。这个值是不可变的,当我们触发一个操作时,它会在集群中的所有节点上懒散地复制:

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

设计数据密集型应用程序

设计数据密集型应用程序

Martin Kleppmann
超越Vibe编程

超越Vibe编程

Addy Osmani
Kafka权威指南(第2版)

Kafka权威指南(第2版)

Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty

Publisher Resources

ISBN: 9798341656932