Chapter 2. Getting Started with Apache Hadoop and Apache Spark
In this chapter, we will understand the basics of Hadoop and Spark, how Spark is different from MapReduce, and get started with the installation of clusters and setting up the tools needed for analytics.
This chapter is divided into the following subtopics:
- Introducing Apache Hadoop
- Introducing Apache Spark
- Discussing why we use Hadoop with Spark
- Installing Hadoop and Spark clusters
Introducing Apache Hadoop
Apache Hadoop is a software framework that enables distributed processing on large clusters with thousands of nodes and petabytes of data. Apache Hadoop clusters can be built using commodity hardware where failure rates are generally high. Hadoop is designed to handle these failures gracefully ...