Spark architecture

Apache Spark is designed to simplify the laborious, and sometimes error prone task of highly-parallelized, distributed computing. To understand how it does this, let's explore its history and identify what Spark brings to the table.

History of Spark

Apache Spark implements a type of data parallelism that seeks to improve upon the MapReduce paradigm popularized by Apache Hadoop. It extended MapReduce in four key areas:

  • Improved programming model: Spark provides a higher level of abstraction through its APIs than Hadoop; creating a programming model that significantly reduces the amount of code that must be written. By introducing a fluent, side-effect-free, function-oriented API, Spark makes it possible to reason about an analytic ...

Get Mastering Spark for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.