Chapter 12. Model Serving Patterns

Once they’ve been trained, ML models are used to generate predictions, or results, a process referred to as running inference or serving the model. The ultimate value of the model is in the results it generates, which should reflect the information in the training data as closely as possible without actually duplicating it. In other words, the ML model should generalize well and be as accurate, reliable, and stable as possible. In this chapter, we will look at some of the many patterns for serving models, and the infrastructure required.

The primary ways to serve a model are as either a batch process or a real-time process. We’ll discuss both, along with pre- and postprocessing of the data, and more specialized applications such as serving at the edge or in a browser.

Batch Inference

After you train, evaluate, and tune an ML model, the model is deployed to production to generate predictions. In applications where a delay is acceptable, a model can be used to provide predictions in batches, which will then be applied to a use case sometime in the future.

Prediction based on batch inference is when your model is used offline, in a batch job, usually for a large number of data points, and where predictions do not have to (or cannot) be generated in real time. In batch recommendations, you might only use historical information about customer–item interactions to make the prediction, without any need for real-time information. In the retail industry, ...

Get Machine Learning Production Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.