Mastering Large Datasets with Python

Chapter 9. PageRank with map and reduce in PySpark

This chapter covers

Options for parallel map and reduce routines in PySpark
Convenience methods of PySpark’s RDD class for common operations
Implementing the historic PageRank algorithm in PySpark

In chapter 7, we learned about Hadoop and Spark, two frameworks for distributed computing. In chapter 8, we dove into the weeds of Hadoop, taking a close look at how we might use it to parallelize our Python work for large datasets. In this chapter, we’ll become familiar with PySpark—the Scala-based, in-memory, large dataset processing framework.

As mentioned in chapter 7, Spark has some advantages:

Spark can be very, very fast.
Spark programs use all the same map and reduce techniques we learned ...

Get Mastering Large Datasets with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Large Datasets with Python by John Wolohan

Chapter 9. PageRank with map and reduce in PySpark

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly