Chapter 1. Introduction to High Performance Spark

This chapter provides an overview of what we hope you will be able to learn from this book and does its best to convince you to learn to read some Scala and consider writing your Spark jobs in Scala or Python.

Feel free to skip ahead to Chapter 2 if you already know what you’re looking for.

What Is Spark and Why Performance Matters

ASF (currently) stands for Apache Software Foundation, although there are calls to rename the foundation. Spark is a high-performance, general-purpose distributed computing system that has become the most active ASF open source project, with more than 1,000 active contributors.1

Spark enables us to process large quantities of data, beyond what can fit on a single machine, with a high-level, relatively easy-to-use API. Spark’s design and interface are unique, and it is one of the fastest systems of its kind. Uniquely, Spark allows us to write the logic of data transformations and machine learning algorithms in a parallelizable way, while being relatively system-agnostic.2

However, despite its many advantages and the excitement around Spark, the simplest implementation of many common data science routines in Spark can be much slower and less robust than the best version. Since the computations we are concerned with may involve data at a very large scale, the time and resource gains from tuning code for performance are enormous. Performance does not just mean running faster; often, at this scale, it means getting something to run at all. It is possible to construct a Spark query that fails on gigabytes of data but, when refactored and adjusted with an eye toward the structure of the data and the requirements of the cluster, succeeds on the same system with terabytes of data. In our experience writing production Spark code, we have seen the same tasks, run on the same clusters, run 100 times faster using some of the optimizations discussed in this book. In terms of data processing, time is money, and we hope this book pays for itself through a reduction in data infrastructure costs and developer hours.

Not all of these techniques apply to every use case. Especially because Spark is highly configurable and is exposed at a higher level than other computational frameworks of comparable power, we can reap tremendous benefits just by becoming more attuned to the shape and structure of our data. Some techniques can work well on certain data sizes or even certain key distributions, but not all. As a simple example, using groupByKey in Spark can very easily cause the dreaded out-of-memory exceptions, but for data with few duplicates, this operation can be just as quick as the alternatives that we will present. Learning to understand your particular use case and system and how Spark will interact with it is a must to solve the most complex data science problems with Spark.

What You Can Expect to Get from This Book

Our hope is that this book will help you take your Spark queries and make them faster, able to handle larger data sizes, and use fewer resources. This book covers a broad range of tools and scenarios. You will likely pick up some techniques that might not apply to the problems you are working with, but that might apply to a problem in the future and may help shape your understanding of Spark more generally. Most of the chapters in this book are written with enough context to allow the book to be used as a reference; however, the structure of this book is intentional and reading the sections in order should give you not only a few scattered tips, but a comprehensive understanding of Spark and how to make it sing.

It’s equally important to point out what you will likely not get from this book. This book is not intended to introduce Spark, Scala, or Python; several other books and video series are available to get you started. The authors may be a little biased in this regard,3 but we think Learning Spark 2nd edition by Jules Damji, Brooke Wenig, Tathagata Das, Denny Lee, is an excellent option for Spark beginners. While this book is focused on performance, it is not an operations book, so topics like setting up a cluster and multitenancy are not covered. We assume you already have a way to use Spark in your system, so we won’t provide much assistance in making higher-level architecture decisions.

Spark Versions

Spark attempts to follow semantic versioning with the standard [MAJOR].[MINOR].[MAINTENANCE] with API stability for public non-experimental non-developer APIs within minor and maintenance releases. Many of these experimental components are some of the more exciting ones from a performance standpoint, including things like custom aggregations, and expressions to accelerate Spark’s Datasets evaluation.

Spark aims for binary API compatibility between releases, using MiMa,4 so if you are using the stable API theoretically, you generally should not need to recompile to run a job against a new version of Spark unless the major version has changed. In practice, we recommend recompiling all Spark jobs against the latest MINOR version as mistakes in binary compatibility have been known to happen.

Tip

This book was created using the Spark 3.3.0 APIs, but much of the code will work in earlier versions of Spark as well. In places where this is not the case, we have attempted to call that out.

Why the focus on Scala and Python?

In this book, we will focus on Spark’s Scala and Python APIs with call-outs to other languages where relevant. Part of this decision is simply in the interest of time and space; we trust readers wanting to use Spark in another language can probably read one of (if not both) Scala and Python.

Note

In the first edition, we said that we believed most users seeking high performance Spark would have to use some Scala. However improvements in Spark’s Python integrations mean this assertion is no longer correct.

While we don’t expect you to know how to write Scala and Python code, we expect that you’ll be able to read Scala or Python.

To Be a Spark Expert You Have to Be Able to Read a Little Scala Anyway

Although Python and Java are more commonly used languages, learning to read Scala is a worthwhile investment for anyone interested in delving deep into Spark development. Spark’s documentation can be uneven. However, the readability of the codebase is world-class. Perhaps more than with other frameworks, the advantages of cultivating a sophisticated understanding of the Spark codebase is integral to the advanced Spark user. Because Spark is written in Scala, it will be difficult to interact with the Spark source code without the ability, at least, to read Scala code. The methods in the Resilient Distributed Datasets (RDD) class closely mimic those in the Scala collections API. RDD functions, such as map, filter, flatMap, reduce, and fold, have nearly identical specifications to their Scala equivalents.5 Even folks who primarily use the API will appreciate being able to understand the underlying RDDs.

Fundamentally Spark is a functional framework, relying heavily on concepts like immutability and lambda definition, so using the Spark API may be more intuitive with some knowledge of functional programming. Programmers familiar with functional programming in Python or Scala (map, filter, etc.) will have the easiest time.

The Spark Scala and Python APIs Are Easier to Use Than the Java API

Once you have learned Scala or Python, you will quickly find that writing Spark in Scala or Python is less painful than writing Spark in Java. The Spark shell can be a powerful tool for debugging and development, and is only available in languages with existing REPLs (Scala, Python, and R).

Why Not Scala?

There are several good reasons to develop with Spark in other languages. One of the more important reasons is developer/team preference. Existing code, both internal and in libraries, can also be a strong reason to use a different language. Python is one of the most supported languages today, with some of the best machine-learning tools available.

While writing Java code can be clunky and sometimes lag slightly in terms of API, there is very little performance cost to writing in another JVM language (at most some object conversions). Spark’s Java APIs also tend to be “longer lived” than it’s Scala APIs, by the simple nature of historically Scala not providing binary compatibility between versions.6

Tip

While all of the examples in this book are presented in Scala for the final release, we will port many of the examples from Scala to Java and Python where the differences in implementation could be important. These will be available (over time) at our GitHub. If you find yourself wanting a specific example ported, please either email us or create an issue on the GitHub repo.

Spark SQL is not only simpler to understand and write, it also has performance optimizations that one would have to do by hand in RDDs. In some ways, Spark SQL is like the “autopilot” option of Spark, you should still keep an eye on it but it does a pretty good job. Since the code is transpiled behind the scenes, the ability to read Scala and Java code can be useful. This transpiling does much to minimize the performance difference when using a non-JVM language. Chapter 8 looks at options to work effectively in Spark with languages outside of the JVM, including Spark’s supported languages of Python and R. This section also offers guidance on how to use Fortran, C, and GPU-specific code to reap additional performance improvements. Even if we are developing most of our Spark applications in Scala, we shouldn’t feel tied to doing everything in Scala, because specialized libraries in other languages can be well worth the overhead of going outside the JVM.

Learning Scala

If after all of this we’ve convinced you to use Scala, there are several excellent options for learning Scala. Spark 3.3 is built against Scala 2.12 and cross-compiled against Scala 2.13.

Depending on how much we’ve convinced you to learn Scala, and what your resources are, there are several different options ranging from books to massive open online courses (MOOCs) to professional training.

For books, Programming Scala, 3rd Edition, by Dean Wampler can be great, although much of the actor system references are not relevant while working in Spark. The Scala language website also maintains a list of Scala books.

In addition to books focused on Spark, there are online courses for learning Scala. Functional Programming Principles in Scala, taught by Martin Ordersky, its creator, is on Coursera as well as Introduction to Functional Programming on edX. A number of different companies also offer video-based Scala courses, none of which the authors have personally experienced.

For those who prefer a more interactive approach, professional training is offered by several companies. Be sure to verify the company’s approach to software licensing early in your exploration.

Conclusion

Although you will likely be able to get the most out of Spark performance if you have some understanding of Scala, working in Spark does not require a knowledge of Scala. Special techniques to consider when working with other languages will be covered in Chapter 8. This book is aimed at individuals who already have a grasp of the basics of Spark, and we thank you for choosing High Performance Spark to deepen your knowledge of Spark.

The next chapter will introduce some of Spark’s general design and evaluation paradigms that are important to understanding how to efficiently utilize Spark.

1 http://spark.apache.org/

2 Spark runs on the majority of common OSs and architectures as well as many types of cluster managers (YARN, Kubernetes, Mesos, etc.). It even runs on less common systems, like z/OS.

3 Holden co-wrote the 1st edition of Learning Spark, and while she receives no royalties on it she thinks the second edition is great.

4 MiMa is the Migration Manager for Scala and tries to catch binary incompatibilities between releases.

5 Although, as we explore in this book, the performance implications and evaluation semantics are quite different.

6 Despite the implications in https://docs.scala-lang.org/overviews/core/binary-compatibility-of-scala-releases.html releases rarely achieve these goals in meaningful ways see https://mungingdata.com/scala/maintenance-nightmare-upgrade/ (plus the empirical evidence of cross building Scala versions on maven central).

Get High Performance Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.