Chapter 28. Real-Time Machine Learning

In this chapter, we explore how to build online classification and clustering algorithms. By online, we mean that these algorithms learn to produce an optimal classification and clustering result on the fly as new data is provided to them.

Note

This chapter assumes that you have a basic understanding of machine learning algorithms, including concepts such as supervised and unsupervised learning, and classification versus clustering algorithms. If you want a quick brush-up of fundamental machine learning concepts, we recommend reading [Conway2012].

In the following sections, we explain the training of several machine learning algorithms performed in an online fashion, with data coming from a stream. Before that, let’s acknowledge that most industry implementations of machine learning models already have an online component: they perform training in a batch fashion—with data at rest—and link it with online inference, as examples are read over a stream of data and the previously trained models are used to score (or predict on) the streaming data.

In this sense, most machine learning algorithms are already deployed in a streaming context, and Spark offers features for making this task easier, whether it’s the simple compatibility between Spark’s batch and streaming APIs (which we’ve addressed in prior chapters) or external projects that aim to make deployment simpler, such as MLeap.

The challenge in making this architecture work—batch ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.