138 | Big Data Simplied
Let us now consider the competition. The primary competition here is Storm. However, by most
standards, Spark overtakes Storm. Spark records a higher throughput by upwards of 40times.
Also, it is only through an add-on to Storm called Trident, where it guarantees exactly-one seman-
tics, which ends up slowing Storm down quite a bit.
But it is important to note that Spark streaming differs from Storm in its implementation.
Storm is a true streaming framework processing each item one by one as it arrives, whereas Spark
processes the incoming data as a series of small, deterministic batch jobs through the underlying
concept of RDDs (Ref. Figure 6.11). This type of streaming is referred to as micro-batching.
6.1.6 Spark Libraries: Machine Learning
Let us now look at one of Spark’s more complex libraries called MLlib. It is complex, due to its
complicate subject matter and its machine learning is itself a complex discipline. It might not be
the most used library, but it is still very active. The machine learning algorithms it exposes are
constantly growing, where a number of which are still experimental.
FIGURE 6.10 Spark streaming
FIGURE 6.11 Spark streaming internal flow
Input data in
Data ready for
report dashboard or
M06 Big Data Simplified XXXX 01.indd 138 5/17/2019 2:49:15 PM