Chapter 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
This chapter will cover some of the more nuanced language specifics of Apache Spark. We’ve seen a huge number of PySpark examples throughout the book. In Chapter 1, we discussed at a high level how Spark runs code from other languages. Let’s talk through some of the more specific integrations:
As a reminder, Figure 32-1 shows the fundamental architecture for these specific languages.
Now let’s cover each of these in depth.
We covered a ton of PySpark throughout this book. In fact, PySpark is included alongside Scala and SQL in nearly every chapter in this book. Therefore, this section will be short and sweet, covering only the details that are relevant to Spark itself. As we discussed in Chapter 1, Spark 2.2 included a way to install PySpark with
pip install pyspark will make it available as a package on your local machine. This is new, so there may be some bugs to fix, but it is something that you can leverage in your projects today.
Fundamental PySpark Differences
If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala, except if you’re not using UDFs in Python. If you’re using a UDF, you may have a performance impact. Refer back to Chapter 6 for more information ...