Resilient distributed datasets

The Spark soul is the resilient distributed dataset. Spark has four design goals: make in-memory (Hadoop is not in-memory) data storage, distribute in a cluster, be fault tolerant, and be fast and efficient.

Fault tolerance is achieved, in part, by applying linear operations on small data chunks. Efficiency is achieved by parallelization of operations throughout all parts of the cluster. Performance is achieved by minimizing data replication between cluster members.

A fundamental concept in Spark is that there are only two types of operations we can do on an RDD:

  • Transformations: A new RDD is created from the original; for example, mapping, filtering, union, intersection, sort, join, coalesce
  • Actions: The original RDD ...

Get Fast Data Processing Systems with SMACK Stack now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.