Chapter 9. PageRank with map and reduce in PySpark

This chapter covers

  • Options for parallel map and reduce routines in PySpark
  • Convenience methods of PySpark’s RDD class for common operations
  • Implementing the historic PageRank algorithm in PySpark

In chapter 7, we learned about Hadoop and Spark, two frameworks for distributed computing. In chapter 8, we dove into the weeds of Hadoop, taking a close look at how we might use it to parallelize our Python work for large datasets. In this chapter, we’ll become familiar with PySpark—the Scala-based, in-memory, large dataset processing framework.

As mentioned in chapter 7, Spark has some advantages:

  • Spark can be very, very fast.
  • Spark programs use all the same map and reduce techniques we learned ...

Get Mastering Large Datasets with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.