January 2020
Intermediate to advanced
312 pages
10h 22m
English
This chapter covers
In chapter 7, we learned about Hadoop and Spark, two frameworks for distributed computing. In chapter 8, we dove into the weeds of Hadoop, taking a close look at how we might use it to parallelize our Python work for large datasets. In this chapter, we’ll become familiar with PySpark—the Scala-based, in-memory, large dataset processing framework.
As mentioned in chapter 7, Spark has some advantages: