Spark architecture

Apache Spark is designed to simplify the laborious, and sometimes error prone task of highly-parallelized, distributed computing. To understand how it does this, let's explore its history and identify what Spark brings to the table.

History of Spark

Apache Spark implements a type of data parallelism that seeks to improve upon the MapReduce paradigm popularized by Apache Hadoop. It extended MapReduce in four key areas:

  • Improved programming model: Spark provides a higher level of abstraction through its APIs than Hadoop; creating a programming model that significantly reduces the amount of code that must be written. By introducing a fluent, side-effect-free, function-oriented API, Spark makes it possible to reason about an analytic ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.