Skip to Main Content
Data Algorithms with Spark
book

Data Algorithms with Spark

by Mahmoud Parsian
April 2022
Intermediate to advanced content levelIntermediate to advanced
435 pages
9h 44m
English
O'Reilly Media, Inc.
Book available
Content preview from Data Algorithms with Spark

Chapter 2. Transformations in Action

In this chapter, we will explore the most important Spark transformations (mappers and reducers) in the context of data summarization design patterns, and examine how to select specific transformations for targeted problems.

As you will see, for a given problem (we’ll use the DNA base count problem here) there are multiple possible PySpark solutions using different Spark transformations, but the efficiency of these transformations differs due to their implementation and shuffle processes (when the grouping of values by key happens). The DNA base count problem is very similar to the classic word count problem (finding the frequency of unique words in a set of files/documents), with the difference that in DNA base counting you find the frequencies of DNA letters (A, T, C, G).

I chose this problem because in solving it we will learn about data summarization, condensing a large quantity of information (here, DNA data strings/sequences) into a much smaller set of useful information (the frequency of DNA letters).

This chapter provides three complete end-to-end solutions in PySpark, using different mappers and reductions to solve the DNA base count problem. We’ll discuss the performance differences between them, and explore data summarization design patterns.

The DNA Base Count Example

The purpose of our example in this chapter is to count DNA bases in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms

Data Algorithms

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781492082378Errata PageSupplemental Content