Chapter 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)

This chapter will cover some of the more nuanced language specifics of Apache Spark. We’ve seen a huge number of PySpark examples throughout the book. In Chapter 1, we discussed at a high level how Spark runs code from other languages. Let’s talk through some of the more specific integrations:

  • PySpark

  • SparkR

  • sparklyr

As a reminder, Figure 32-1 shows the fundamental architecture for these specific languages.

Figure 32-1. The Spark Driver

Now let’s cover each of these in depth.


We covered a ton of PySpark throughout this book. In fact, PySpark is included alongside Scala and SQL in nearly every chapter in this book. Therefore, this section will be short and sweet, covering only the details that are relevant to Spark itself. As we discussed in Chapter 1, Spark 2.2 included a way to install PySpark with pip. Simply, pip install pyspark will make it available as a package on your local machine. This is new, so there may be some bugs to fix, but it is something that you can leverage in your projects today.

Fundamental PySpark Differences

If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala, except if you’re not using UDFs in Python. If you’re using a UDF, you may have a performance impact. Refer back to Chapter 6 for more information ...

Get Spark: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.