Chapter 1. Introduction to Apache Spark: A Unified Analytics Engine
This chapter lays out the origins of Apache Spark and its underlying philosophy. It also surveys the main components of the project and its distributed architecture. If you are familiar with Spark’s history and the high-level concepts, you can skip this chapter.
The Genesis of Spark
In this section, we’ll chart the course of Apache Spark’s short evolution: its genesis, inspiration, and adoption in the community as a de facto big data unified processing engine.
Big Data and Distributed Computing at Google
When we think of scale, we can’t help but think of the ability of Google’s search engine to index and search the world’s data on the internet at lightning speed. The name Google is synonymous with scale. In fact, Google is a deliberate misspelling of the mathematical term googol: that’s 1 plus 100 zeros!
Neither traditional storage systems such as relational database management systems (RDBMSs) nor imperative ways of programming were able to handle the scale at which Google wanted to build and search the internet’s indexed documents. The resulting need for new approaches led to the creation of the Google File System (GFS), MapReduce (MR), and Bigtable.
While GFS provided a fault-tolerant and distributed filesystem across many commodity hardware servers in a cluster farm, Bigtable offered scalable storage of structured data across GFS. MR introduced a new parallel programming paradigm, based on functional programming, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access