Chapter 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
This chapter will cover some of the more nuanced language specifics of Apache Spark. We’ve seen a huge number of PySpark examples throughout the book. In Chapter 1, we discussed at a high level how Spark runs code from other languages. Let’s talk through some of the more specific integrations:
-
PySpark
-
SparkR
-
sparklyr
As a reminder, Figure 32-1 shows the fundamental architecture for these specific languages.
Figure 32-1. The Spark Driver
Now let’s cover each of these in depth.
PySpark
We covered a ton of PySpark throughout this book. In fact, PySpark is included alongside Scala and SQL in nearly every chapter in this book. Therefore, this section will be short and sweet, covering only the details that are relevant to Spark itself. As we discussed in Chapter 1, Spark 2.2 included a way to install PySpark with pip. Simply, pip install pyspark will make it available as a package on your local machine. This is new, so there may be some bugs to fix, but it is something that you can leverage in your projects today.
Fundamental PySpark Differences
If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala, except if you’re not using UDFs in Python. If you’re using a UDF, you may have a performance impact. Refer back to Chapter 6 for more information ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access