Chapter 3. Mapper Transformations

This chapter will introduce the most common Spark mapper transformations through simple working examples. Without a clear understanding of transformations, it is hard to use them in a proper and meaningful way to solve any data problem. We will examine mapper transformations in the context of RDD data abstractions. A mapper is a function that is used to process all the elements of a source RDD and generate a target RDD. For example, a mapper can transform a String record into tuples, (key, value) pairs, or whatever your desired output may be. Informally, we can say that a mapper transforms a source RDD[V] into a target RDD[T], where V and T are the data types of the source and target RDDs, respectively. You may apply mapper transformations to DataFrames as well, by either applying DataFrame functions (using select() and UDFs) to all rows or converting your DataFrame (a table of rows and columns) to an RDD and then using Spark’s mapper transformations.

Data Abstractions and Mappers

Spark has many transformations and actions, but this chapter is dedicated to explaining the ones that are most often used in building Spark applications. Spark’s simple and powerful mapper transformations enable us to perform ETL operations in a simple way.

As I’ve mentioned, the RDD is an important data abstraction in Spark that is suitable for unstructured and semi-structured ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.