Chapter 9. PageRank with map and reduce in PySpark

This chapter covers

  • Options for parallel map and reduce routines in PySpark
  • Convenience methods of PySpark’s RDD class for common operations
  • Implementing the historic PageRank algorithm in PySpark

In chapter 7, we learned about Hadoop and Spark, two frameworks for distributed computing. In chapter 8, we dove into the weeds of Hadoop, taking a close look at how we might use it to parallelize our Python work for large datasets. In this chapter, we’ll become familiar with PySpark—the Scala-based, in-memory, large dataset processing framework.

As mentioned in chapter 7, Spark has some advantages:

  • Spark can be very, very fast.
  • Spark programs use all the same map and reduce techniques we learned ...

Get Mastering Large Datasets with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.