Chapter 8. Distributed Environments – Hadoop and Spark
In this chapter, we will introduce a new way to process data, scaling horizontally. So far, we've focused our attention primarily on processing big data on a standalone machine; here, we will introduce some methods that run on a cluster of machines.
Specifically, we will first illustrate the motivations and circumstances when we need a cluster to process big data. Then, we will introduce the Hadoop framework and all its components with a few examples (HDFS, MapReduce, and YARN), and finally, we will introduce the Spark framework and its Python interface—pySpark.
From a standalone machine to a bunch of nodes
The amount of data stored in the world is increasing exponentially. Nowadays, for a data ...