Apache Spark core

The RDD is the core data structure of the Apache Spark architecture. RDDs store and preserve data distributed and partitioned over multiple processors and servers so operations can be executed concurrently.

Data frames have been added, later on, to extend RDDs with SQL functionality. The original Apache Spark machine learning library, MLlib, uses RDDs that operate at a lower level (API). The more recent ML library allows data scientists to describe transformation and actions using SQL.

Note

Deprecation RDD-based API for MLlib

The RDD-based classes and methods in MLlib have moved to maintenance mode in Spark 2.0 and will be completely removed in Spark 3.0

Why Spark?

The introduction of the Hadoop ecosystem more than 10 years ago, ...

Get Scala for Machine Learning - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.