Chapter 28. Real-Time Machine Learning

In this chapter, we explore how to build online classification and clustering algorithms. By online, we mean that these algorithms learn to produce an optimal classification and clustering result on the fly as new data is provided to them.

Note

This chapter assumes that you have a basic understanding of machine learning algorithms, including concepts such as supervised and unsupervised learning, and classification versus clustering algorithms. If you want a quick brush-up of fundamental machine learning concepts, we recommend reading [Conway2012].

In the following sections, we explain the training of several machine learning algorithms performed in an online fashion, with data coming from a stream. Before that, letâs acknowledge that most industry implementations of machine learning models already have an online component: they perform training in a batch fashionâwith data at restâand link it with online inference, as examples are read over a stream of data and the previously trained models are used to score (or predict on) the streaming data.

In this sense, most machine learning algorithms are already deployed in a streaming context, and Spark offers features for making this task easier, whether itâs the simple compatibility between Sparkâs batch and streaming APIs (which weâve addressed in prior chapters) or external projects that aim to make deployment simpler, such as MLeap.

The challenge in making this architecture workâbatch ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Stream Processing with Apache Spark by

Chapter 28. Real-Time Machine Learning

Note

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly