138 | Big Data Simplied
Let us now consider the competition. The primary competition here is Storm. However, by most
standards, Spark overtakes Storm. Spark records a higher throughput by upwards of 40times.
Also, it is only through an add-on to Storm called Trident, where it guarantees exactly-one seman-
tics, which ends up slowing Storm down quite a bit.
But it is important to note that Spark streaming differs from Storm in its implementation.
Storm is a true streaming framework processing each item one by one as it arrives, whereas Spark
processes the incoming data as a series of small, deterministic batch jobs through the underlying
concept of RDDs (Ref. Figure 6.11). This type of streaming is referred to as micro-batching.
6.1.6 Spark Libraries: Machine Learning
Let us now look at one of Spark’s more complex libraries called MLlib. It is complex, due to its
complicate subject matter and its machine learning is itself a complex discipline. It might not be
the most used library, but it is still very active. The machine learning algorithms it exposes are
constantly growing, where a number of which are still experimental.
FIGURE 6.10 Spark streaming
Spark streaming
KAFKA
FLUME
HDFS/S3
TWITTER
HDFS
Databases
Report
dashboard
MLlib
Machine
learning
Spark SQL
Train machine
models with
live data
User trained
model
Process with
data frames
Interactively
fast query
with SQL
FIGURE 6.11 Spark streaming internal flow
Spark streaming
Input data
stream
Input data in
batches
Spark engine
Batches of
cleansed and
processed data
Data ready for
report dashboard or
intelligent analytics
M06 Big Data Simplified XXXX 01.indd 138 5/17/2019 2:49:15 PM

Get Big Data Simplified now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.