Chapter 6. Apache Spark Implementation

Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It is currently the largest open source community in big data, with more than 1,000 contributors representing more than 250 organizations. The main characteristics of Spark are as follows:


It is engineered from the bottom up for performance and can be very fast by exploiting in-memory computing and other optimizations.

Ease of use

It provides easy-to-use APIs for operating on large datasets including a collection of more than 100 operators for transforming data and data frame APIs for manipulating semi-structured data.

Unified engine

It is packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant processing of real-time data streams. In the next section, we discuss how to use Spark Streaming for implementing our solution.

Overall Architecture

Spark Streaming currently uses a minibatch approach to streaming, although a true streaming implementation is under way (discussed shortly). Because our problem requires state, the only option for our implementation is to use the mapWithState operator, where state is a Resilient Distributed Dataset ...

Get Serving Machine Learning Models now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.