Chapter 8. PyTorch in Production

Now that you’ve learned how to use PyTorch to classify images, text, and sound, the next step is to look at how to deploy PyTorch applications in production. In this chapter, we create applications that run inference on PyTorch models over HTTP and gRPC. We then package those applications into Docker containers and deploy them to a Kubernetes cluster running on Google Cloud.

In the second half, we look at TorchScript, a new technology introduced in PyTorch 1.0 that allows us to use just-in-time (JIT) tracing to produce optimized models that can be run from C++. We also have a brief look at how to compress models with quantization. First up, let’s look at model serving.

Model Serving

We’ve spent the last six chapters building models in PyTorch, but building a model is only part of building a deep learning application. After all, a model may have amazing accuracy (or other relevant metric), but if it never makes any predictions, is it worth anything? What we want is an easy way to package our models so they can respond to requests (either over the web or other means, as we’ll see) and can be run in production with the minimum of effort.

Thankfully, Python allows us to get a web service up and running quickly with the Flask framework. In this section, we build a simple service that loads our ResNet-based cat or fish model, accepts requests that include an image URL, and returns a JSON response that indicates whether the image contains a cat or a ...

Get Programming PyTorch for Deep Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.