Chapter 3. Mapper Transformations
This chapter will introduce the most common Spark
mapper transformations through simple
working examples. Without a clear understanding
of transformations, it is hard to
use them in a proper and meaningful way
to solve any data problem. We will examine
mapper transformations in the context of
RDD data abstractions. A mapper
is a function that is used to process
all the elements of a source RDD and generate
a target RDD. For example, a mapper can
transform a String
record into tuples,
(key, value) pairs, or whatever your desired output may be. Informally, we can say that a mapper transforms a source
RDD[V]
into a target RDD[T]
, where
V
and T
are the data types of the source
and target RDDs, respectively. You may apply
mapper transformations to DataFrames as well,
by either applying DataFrame functions
(using select()
and UDFs) to
all rows or converting your DataFrame
(a table of rows and columns) to an RDD
and then using Spark’s mapper
transformations.
Data Abstractions and Mappers
Spark has many transformations and actions, but this chapter is dedicated to explaining the ones that are most often used in building Spark applications. Spark’s simple and powerful mapper transformations enable us to perform ETL operations in a simple way.
As I’ve mentioned, the RDD is an important data abstraction in Spark that is suitable for unstructured and semi-structured ...
Get Data Algorithms with Spark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.