Chapter 8: Scaling out Python Using Clusters

In the previous chapter, we discussed parallel processing for a single machine using threads and processes. In this chapter, we will extend our discussion of parallel processing from a single machine to multiple machines in a cluster. A cluster is a group of computing devices that work together to perform compute-intensive tasks such as data processing. In particular, we will study Python's capabilities in the area of data-intensive computing. Data-intensive computing typically uses clusters for processing large volumes of data in parallel. Although there are quite a few frameworks and tools available for data-intensive computing, we will focus on Apache Spark as a data processing engine and PySpark ...

Get Python for Geeks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.