O'Reilly logo

Building Machine Learning Pipelines by Catherine Nelson, Hannes Hapke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Model Deployment with TensorFlow Serving

The deployment of your machine learning model is the last step before others can use your model and make predictions with the model. Unfortunately, the deployment of machine learning models falls into a grey zone in today’s thinking of the division of labor in the digital world. It isn’t just a DevOps task since it requires some knowledge of the model architecture and its hardware requirements. At the same time, deploying deep learning models is a bit outside of the comfort zone of machine learning engineers and data scientists. They know their models inside out but tend to struggle with the DevOps side of the deployment part. In this and the following chapters, we want to bridge the gap between the worlds and guide data scientists and DevOps engineers through the steps to deploy machine learning models.

Machine learning models can be deployed in mainly three ways. The most common way today is the deployment of a machine learning model to a model server. The client which requests a prediction submits the input data to the model server and in return will receive a prediction. This requires that the client can connect with the model server. In this chapter, we are focusing on this type of model deployment.

There are situations where you don’t want to submit the input data to a model server, e.g., when the input data is sensitive, or there are privacy concerns. In this situation, you can deploy the machine learning model to the user’s browser. For example, if you want to determine if an image contains sensitive information, you could classify the sensitivity level of the image before it is uploaded to a cloud server.

However, then there is a third type of model deployment: to edge devices. There are situations which don’t allow you to connect to a model server to make predictions, e.g. remote sensors or IoT devices. The number of applications being deployed to edge devices is increasing and it is now a valid option for model deployments. In the following chapter, we will describe the last two mentioned methods.

In this chapter, we highlight TensorFlow’s Serving module and introduce its setup and usage. This is not the only way of deploying deep learning models; there are a few options existing at the moment. At the time of writing this chapter, we feel that Tensorflow Serving offers the most simple server setup and the best performance.

Let’s start the chapter with how you shouldn’t set up a model server, before we deep dive into TensorFlow Serving.

A Simple Model Server

Most introductions to deploying machine learning models follow roughly the same script:

  • Create a web app with Python (Flask or Django)

  • Create an API endpoint in the web app

  • Load the model structure and its weights

  • Call the predict method on the loaded model

  • Return the prediction results as an HTTP request

Example 4-1 shows an example implementation of such an endpoint for our basic model.

Example 4-1. Example Setup of a Flask Endpoint to Infer Model Predictions
import json
from flask import Flask
from keras.models import load_model
from utils import preprocess 1

model = load_model('model.h5') 2
app = Flask(__name__)

@app.route('/classify', methods=['POST'])
def classify():
    review = request.form["review"]
    preprocessed_review = preprocess(review)
    prediction = model.predict_classes([preprocessed_review])[0] 3
    return json.dumps({"score": int(prediction)})
1

Preprocessing to convert characters to indices

2

Load your trained model

3

Perform the prediction and return the prediction in the http response

This setup is a quick and easy implementation, perfect for demonstration projects. We do not recommend using Example 4-1 to deploy machine learning models to production endpoints.

In the next section, let’s discuss why we don’t recommend deploying deep learning models with such a setup. The reasons are our benchmark for our proposed deployment solution.

Why it isn’t Recommended

While the Example 4-1 implementation can be sufficient for demonstration purposes, it has some significant drawbacks for scalable machine learning deployments.

Code Separation

In our demonstration example Example 4-1, we assumed that the trained model is deployed with the API code base and also lives in the code base. That means that there is no separation between the API code and the machine learning model. This can be problematic when the data scientists want to update a model, and such an update requires coordination with the API team. Such coordination also requires that the API and data science teams work in sync to avoid unnecessary delays on the model deployments.

With the intertwined API and data science code base, it also creates ambiguity around the API ownership.

The missing code separation also requires that the model has to be loaded in the same programming language as the API code is written. This mixing of backend and data science code can ultimately prevent your API team from upgrading your API backend.

In this chapter, we highlight how you can separate your models from your API code effectively and simplify your deployment workflows.

Lack of Model Version Control

Example 4-1 doesn’t provide any provision for different model versions. If you wanted to add a new version, you would have to create a new endpoint (or add some branching logic to the existing endpoint). This requires extra attention to keep all endpoints structurally the same, and it requires much boilerplate code.

The lack of model version control also requires the API and the data science team to coordinate which version is the default version and how to phase in new models.

Inefficient Model Inference

For any request to your prediction endpoint based written in the Flask setup as shown in Example 4-1, a full round trip is performed. That means each request is preprocessed and inferred individually. The key reason why we argue that such a setup is only for demonstration purposes is that it is highly inefficient. During the training of your model, you probably use a batching technique which allows you to compute multiple samples at the same time and then apply the gradient change for your batch to your network’s weights. You can apply the same technique when you want the model to make predictions. A model server can gather all requests during an expectable timeframe or until the batch is full and ask the model for its predictions. This is an especially effective method when the inference runs on GPUs.

In this chapter, we introduce how you can easily set up such a batching behavior for your model server.

TensorFlow Serving

As you have seen along with the chapters of this book, TensorFlow comes with a fantastic ecosystem of extensions and tools. One of the earlier open-sourced extensions was TensorFlow Serving. It allows you to deploy any TensorFlow graph and you can make predictions from the graph through the standardized endpoints. As we discuss in a moment, TensorFlow Serving handles the model and version management for you, lets you serve models based on policies and lets you load your models from various sources. At the same time, it is focused on high-performance throughput for low-latency predictions. TensorFlow Serving is being used internally at Google and is adopted by a good number of corporations and startups 1.

TensorFlow Architecture Overview

TensorFlow Serving provides you the functionality to load models from a given source (e.g. AWS S3 buckets) and notifies the Loader if the source has changed. As [Link to Come] shows, everything behind the scenes of TensorFlow Serving is controlled by a Model Manager which manages when to update the models and which model is used for the inferences. The rules for the inference determination are set by the policy which is managed by the model manager. Depending on your configuration, you can, for example, load one model at a time and have the model update automatically once the source module detects a newer version.

TensorFlow_Serving_Architecture
Figure 4-1. Overview of the TensorFlow Serving Architecture

Before the deep dive into the server configuration to show you how you can set up these policies, let’s talk about how we need to export the models so that they can be loaded into TensorFlow Serving.

Exporting Models for TensorFlow Serving

Before we dive into the TensorFlow Serving configurations, let’s discuss how you can export your machine learning models so that they can be used by TensorFlow Serving.

Depending on your type of TensorFlow model, the export steps are slightly different. The exported models have the same file structure as we see in a moment.

For Keras models, you can use

saved_model_path = tf.saved_model.save(model, "./saved_models")

while for TensorFlow Estimator models, you need to first declare a receiver function

def serving_input_receiver_fn():
    sentence = tf.placeholder(dtype=tf.string, shape=[None, 1], name='sentence')
    fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
        features={'sentence': sentence})
    return fn

and then export the Estimator model with the export_saved_model method of the estimator.

estimator = tf.estimator.Estimator(model_fn, 'model', params={})
estimator.export_saved_model('saved_models/', serving_input_receiver_fn)

Both export methods produce output which looks similar to this following output.

...
WARNING:tensorflow:Model was compiled with an optimizer, but the optimizer is not from `tf.train` (e.g. `tf.train.AdagradOptimizer`). Only the serving graph was exported. The train and evaluate graphs were not added to the SavedModel.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:205: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: saved_models/1555875926/saved_model.pb
Model exported to:  b'saved_models/1555875926'

In your specified output directory, in our example we used saved_models/, we find the exported model. For every exported model, TensorFlow creates a directory with the timestamp of the export as its folder name.

$ tree saved_models/
saved_models/
└── 1555875926
    ├── assets
    │   └── saved_model.json
    ├── saved_model.pb
    └── variables
        ├── checkpoint
        ├── variables.data-00000-of-00001
        └── variables.index

3 directories, 5 files

The folder contains the following files and subdirectories:

saved_model.pb

The binary protobuf file contains the exported model graph structure as a MetaGraphDef object.

variables

The folder contains the binary files with the exported variable values and checkpoints corresponding to the exported model graph.

assets

This folder is created when additional files are needed to load the exported model. The additional file can vocabularies which we have seen in the data preprocessing chapter.

Model Versioning

When data scientists and machine learning engineers export models, the question of model versioning often comes up. In Software Engineering developers have the common understanding that minor changes and bug fixes lead to an increase of the minor version number and large code changes, especially breaking changes, require an update of the major version number.

In machine learning, we have at least 2 more degrees of freedom: While your model code describes the model’s architecture and the support functionality (e.g. preprocessing), you can encounter different model performance with a single code change.

Your model can produce very different prediction results just by tweaking model hyperparameters like the learning rate, dropout rate and so on.

At the same time, no changes to the model architecture and parameters can produce a model with different performance characteristics.

No clear consensus has been established around the versioning of models. From our experience, this worked very well:

  • Changes in the model architecture (e.g., new layers) require a new model name. The name should be descriptive of the model’s architecture.

  • Changes to the model’s hyperparameters require a new model version number.

  • Training the same model with a more extensive data set also requires a new model version number to differentiate that the model performance will be different than from previous model exports.

Instead of increasing the model version number, it has been beneficial to use the Unix epoch timestamp of the export time as the version number. The newer model versions will then always have a higher model version number, and the machine learning engineer doesn’t need to worry about which version to increase.

Model Signatures

Models are exported with a together with a signature which specifies the graph inputs and outputs. Inputs to your model graph are determined by your Keras InputLayer definitions or, in the case of a TensorFlow Estimator, they are defined by your serving_input_receiver_fn function. The model outputs are determined by the model graph.

For example, a classification model takes an input sentence and outputs the predicted classes together with the corresponding scores.

signature_def: {
  key  : "classification_signature"
  value: {
    inputs: {
      key  : "inputs"
      value: {
        name: "sentence:0"
        dtype: DT_STRING
        tensor_shape: ...
      }
    }
    outputs: {
      key  : "classes"
      value: {
        name: "index_to_string:0"
        dtype: DT_STRING
        tensor_shape: ...
      }
    }
    outputs: {
      key  : "scores"
      value: {
        name: "TopKV2:0"
        dtype: DT_FLOAT
        tensor_shape: ...
      }
    }
    method_name: "tensorflow/serving/classify"
  }
}

TensorFlow Serving provides high-level APIs for three common use cases:

  • Classification

  • Prediction

  • Regression

Each inference use case corresponds to a different model signature.

For example, linear regression models based on tf.estimator.LinearRegressor are exported with a regression signature like

signature_def: {
  key  : "regression_signature"
  value: {
    inputs: {
      key  : "inputs"
      value: {
        name: "input_tensor_0"
        dtype: ...
        tensor_shape: ...
      }
    }
    outputs: {
      key  : "outputs"
      value: {
        name: "y_outputs_0"
        dtype: DT_FLOAT
        tensor_shape: ...
      }
    }
    method_name: "tensorflow/serving/regress"
  }
}

At the same time, prediction models based on TensorFlow’s tf.estimator.LinearEstimator are exported with prediction signature

signature_def: {
  key  : "prediction_signature"
  value: {
    inputs: {
      key  : "inputs"
      value: {
        name: "sentence:0"
        dtype: ...
        tensor_shape: ...
      }
    }
    outputs: {
      key  : "scores"
      value: {
        name: "y:0"
        dtype: ...
        tensor_shape: ...
      }
    }
    method_name: "tensorflow/serving/predict"
  }
}

Inspecting Exported Models

After all the talk about exporting your model and the corresponding model signatures, let’s discuss how you can inspect the exported models before deploying them with TensorFlow Serving.

When you install the TensorFlow Serving Python API with

$pip install tensorflow-serving-api

you have access to a handy command line tool called SavedModel CLI. saved_model_cli lets you

  • Inspect the signatures of exported models: This is very useful primarily when you didn’t export the model yourself, and you want to learn about the inputs and outputs of the model graph.

  • Test the exported models: The CLI tools let you infer the model without deploying it with TensorFlow Serving. This is extremely useful when you want to test your model input data.

Inspecting the Model

saved_model_cli helps you understand the model dependencies without inspecting the original graph code.

If you don’t know the available tag-sets, you can inspect the model with

$ saved_model_cli show --dir saved_models/
The given SavedModel contains the following tag-sets:
serve

If your model contains different graphs for different environments, e.g., a graph for a CPU or GPU inference, you will see multiple tags. If your model contains multiple tags, you need to specify a tag to inspect the details of the model.

Once you know the tag_set you want to inspect, add it as an argument, and saved_model_cli will provide you the available model signatures. Our example model has only one signature which is called serving_default.

$ saved_model_cli show --dir saved_models/ --tag_set serve
The given SavedModel `MetaGraphDef` contains `SignatureDefs` with the
following keys:
SignatureDef key: "serving_default"

With the tag_set and signature_def information, you can now inspect the model’s inputs and outputs. To obtain the detailed information, add the signature_def to the CLI arguments.

$ saved_model_cli show --dir saved_models/ \
        --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['sentence'] tensor_info:
      dtype: DT_STRING
      shape: (-1, 1)
      name: x:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['rating'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: rating:0
Method name is: tensorflow/serving/predict

If you want to see all signatures regardless of the tag_set and signature_def, you can use the --all argument

$ saved_model_cli show --dir saved_models/ --all
...

Testing the Model

saved_model_cli also lets you test the export model with sample input data.

You have three different ways to submit the sample input data for the model test inference.

--inputs

The argument points at a NumPy file containing the input data formatted as NumPy ndarray.

--input_exprs

The argument allows you to define a Python expression to specify the input data. You can use NumPy functionality in your expressions.

--input_examples

The argument is expecting the input data formatted a tf.Example data structure (check the chapter on data validation where we introduced the tf.Example data structure)

For testing the model, you can specify exactly one of the input arguments. Furthermore, saved_model_cli provides three optional arguments:

--outdir

saved_model_cli will write any graph output to stdout. If you rather would like to write the output to a file, you can specify the target directory with --outdir.

--overwrite

If you opt for writing the output to a file, you can specify with --overwrite that the files can be overwritten.

--tf_debug

If you further want to inspect the model, you can step through the model graph with the TensorFlow Debugger (tfdbg).

Here is an example inspection of our demonstration model:

$ saved_model_cli run --dir saved_models/ --tag_set serve \
--signature_def x1_x2_to_y --input_examples `examples=[{"state_xf":"CA"}]`

After all the introduction of how to export models and how to inspect them, let’s dive into the TensorFlow Serving installation, setup and operation.

Setting up TensorFlow Serving

There are two easy ways to get TensorFlow Serving installed on your serving instances. You can either run TensorFlow Serving on Docker or, if you run an Ubuntu OS on your serving instances, you can install the Ubuntu package.

Docker Installation

The easiest way of installing TensorFlow Serving is downloading the pre-build docker image. As you have seen in Chapter 2, you can obtain the image by running

$ docker pull tensorflow/serving
Note

If you haven’t installed or used Docker before, check out our brief introduction in Appendix [Link to Come].

If you are running the Docker container on an instance with GPU’s available you will need to download the latest build with GPU-support.

$ docker pull tensorflow/serving:latest-gpu

The Docker image with GPU support requires NVIDIA’s Docker support for GPUs. The installation steps can be found on the company’s website 2.

Native Ubuntu Installation

If you want to run TensorFlow Serving without the overhead of running Docker, you can install Linux binary packages available for Ubuntu distributions.

The installation steps are similar to other non-standard Ubuntu packages. First, you need to add a new package source the distribution’s source list or add a new list file to the sources.list.d directory by executing

$ echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt \
  stable tensorflow-model-server tensorflow-model-server-universal" \
  | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list

in your Linux terminal. Before updating your package registry, you should add the packages’ public key to your distribution’s key chain.

$ curl https://storage.googleapis.com/tensorflow-serving-apt/\
tensorflow-serving.release.pub.gpg | sudo apt-key add -

After updating your package registry, you can install TensorFlow Serving on your Ubuntu operating system.

$ apt-get update
$ apt-get install tensorflow-model-server
Warning

Google provides two Ubuntu packages for TensorFlow Serving! The earlier referenced tensorflow-model-serve package is the preferred package, and it comes with specific CPU optimizations pre-compiled (e.g., AVX instructions).

At the time of writing this chapter, a second package with the name tensorflow-model-server-universal is also provided. It doesn’t contain the pre-compiled optimizations and can, therefore, be run on old hardware (e.g., CPUs without the AVX instruction set).

Building TensorFlow Serving from Source

Should you be in the situation that you can run TensorFlow Serving on a pre-built Docker image or take advantage of the Ubuntu packages, for example, if you are running on a different Linux distribution, you can build TensorFlow Serving from the source.

At the moment, you can only build TensorFlow Serving for Linux operating systems and the build tool bazel is required. You can find detailed instructions in the TensorFlow Serving documentation 3.

If you build TensorFlow Serving from scratch, it is highly recommended to compile the Serving version for the specific TensorFlow version of your models and the available hardware of your serving instances.

Configure a TensorFlow Server

Out of the box, TensorFlow Serving can run in two different modes. You can specify a model, and TensorFlow Serving will always provide the latest model. Alternatively, you can specify a configuration file with all models and versions to be loaded, and TensorFlow Serving will load all named models.

Single Model Configuration

If you want to run TensorFlow Serving loading a single model and switch to newer model versions when they are available, the single model configuration is preferred.

If you run TensorFlow Serving in a Docker environment, you can run the tensorflow\serving image with the following command

$ docker run -p 8500:8500 \ 1
             -p 8501:8501 \
             --mount type=bind,source=/tmp/models,target=/models/my_model \ 2
             -e MODEL_NAME=my_model 3
             -t tensorflow/serving 4
1

Specify the default ports

2

Create a bin mound to load the models

3

Specify your model

4

Specify the docker image

By default, TensorFlow Serving is configured to create a REST and gRPC endpoint. By specifying both ports, 8500 and 8501, we expose the REST and gRPC capabilities. The docker run command creates a mount between a folder on the host (source) and within the container (target) filesystem. In Chapter 2, we discussed how to pass environmental variables to the docker container. To run the server in a single model configuration, you need to specify the model name MODEL_NAME.

If you want to run the docker image pre-built for GPU images, you need to swap out the name of the docker image to latest GPU-build with

$ docker run ...
             -t tensorflow/serving:latest-gpu

If you have decided to run TensorFlow Serving without the Docker container, you can run it with the command

$ tensorflow_model_server --port=8500 \
                          --rest_api_port=8501 \
                          --model_name=my_model \
                          --model_base_path=/models/my_model

In both scenarios, you should see output on your terminal which is similar to the following

2019-04-26 03:51:20.304826: I tensorflow_serving/model_servers/server.cc:82]
  Building single TensorFlow model file config:
  model_name: my_model model_base_path: /models/my_model
2019-04-26 03:51:20.307396: I tensorflow_serving/model_servers/server_core.cc:461]
  Adding/updating models.
2019-04-26 03:51:20.307473: I tensorflow_serving/model_servers/server_core.cc:558]
  (Re-)adding model: my_model
...
2019-04-26 03:51:34.507436: I tensorflow_serving/core/loader_harness.cc:86]
  Successfully loaded servable version {name: my_model version: 1556250435}
2019-04-26 03:51:34.516601: I tensorflow_serving/model_servers/server.cc:313]
  Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 237] RAW: Entering the event loop ...
2019-04-26 03:51:34.520287: I tensorflow_serving/model_servers/server.cc:333]
  Exporting HTTP/REST API at:localhost:8501 ...

From the server output, you can see that the server loaded our model my_model successfully and that created two endpoints: One REST and one gRPC endpoint.

TensorFlow Serving makes the deployment of machine learning models extremely easy. One great advantage of serving models that way is the “hot swap” capability. If a new model is uploaded, the server’s model manager will detect the new version, unload the existing model and load the newer model for inferencing.

Let’s say you update the model and export the new model version to the mounted folder on the host machine (if you are running with the docker setup), no configuration change is required. The model manager will detect the newer model and reload the endpoints. It will notify you about the unloading of the older model and the loading of the newer model. In your terminal, you should find messages like

2019-04-30 00:21:56.486988: I tensorflow_serving/core/basic_manager.cc:739]
  Successfully reserved resources to load servable {name: movie version: 1556583584}
2019-04-30 00:21:56.487043: I tensorflow_serving/core/loader_harness.cc:66]
  Approving load for servable version {name: movie version: 1556583584}
2019-04-30 00:21:56.487071: I tensorflow_serving/core/loader_harness.cc:74]
  Loading servable version {name: movie version: 1556583584}
...
2019-04-30 00:22:08.839375: I tensorflow_serving/core/loader_harness.cc:119]
  Unloading servable version {name: movie version: 1556583236}
2019-04-30 00:22:10.292695: I ./tensorflow_serving/core/simple_loader.h:294]
  Calling MallocExtension_ReleaseToSystem() after servable unload with 1262338988
2019-04-30 00:22:10.292771: I tensorflow_serving/core/loader_harness.cc:127]
  Done unloading servable version {name: movie version: 1556583236}

TensorFlow Serving will load the model with the highest version number. If you use the export methods shown earlier in this chapter, all models will be exported in folders with the epoch timestamp as the folder name. Therefore, newer models will have a higher version number than older models.

Multi Model Configuration

You can configure TensorFlow Serving to load multiple models at the same time. To do that you need to create a configuration file to specify the models.

model_config_list {
  config {
    name: 'my_model'
    base_path: '/models/my_model/'
  }
  config {
    name: 'another_model'
    base_path: '/models/another_model/'
  }
}

The configuration file contains one or more config dictionaries, all listed below a model_config_list key.

In your Docker configuration you can mount the configuration file and load the model server with the configuration file instead of a single model.

$ docker run -p 8500:8500 \
             -p 8501:8501 \
             --mount type=bind,source=/tmp/models,target=/models/my_model \
             --mount type=bind,source=/tmp/model_config,target=/models/model_config \ 1
             -e MODEL_NAME=my_model \
             -t tensorflow/serving \
             --model_config_file=/models/model_config 2
1

Mount the configuration file

2

Specify the model configuration file

If you can TensorFlow Serving outside of a Docker container, you can point the model server to the configuration file with the additional argument model_config_file and the configuration will be loaded from the file

$ tensorflow_model_server --port=8500 \
                          --rest_api_port=8501 \
                          --model_config_file=/models/model_config

Configure Specific Model Versions

There are situations when you want to load not just the latest model version, about either all or specific model versions. TensorFlow Serving, by default, always loads the latest model version. If you want to load all available model versions, you can extend the model configuration file with

  ...
  config {
    name: 'another_model'
    base_path: '/models/another_model/'
    model_version_policy: {all: {}}
  }
  ...

If you want to specify specific model versions, you can define them as well.

  ...
  config {
    name: 'another_model'
    base_path: '/models/another_model/'
    model_version_policy {
      specific {
        versions: 1556250435
        versions: 1556251435
      }
    }
  }
  ...

You can even give the model versions labels. The labels can extremely handy later when you want to make predictions from the models.

  ...
  model_version_policy {
    specific {
      versions: 1556250435
      versions: 1556251435
    }
  }
  version_labels {
    key: 'stable'
    value: 1556250435
  }
  version_labels {
    key: 'testing'
    value: 1556251435
  }
  ...

REST vs gRPC

In the configuration section, we discussed how TensorFlow Serving allows two different API types: REST and gRPC. Both protocols have their advantages and disadvantages, and we would like to take a moment to introduce both before we dive into how you can communicate with these endpoints.

Representational State Transfer

Representational State Transfer, or short REST, is a communication “protocol” used by today’s web services. It isn’t a formal protocol, but more a communication style which defines how clients communicate with web services. REST clients communicate with the server using the standard HTTP methods like GET, POST, DELETE, etc. The payloads of the requests are often encoded as XML or JSON data formats.

Google Remote Procedures Calls

Remote Procedures Calls, or short gRPC, is a remote procedure protocol developed by Google. While gRPC supports different data formats, the standard data format used with gRPC is Protobuf which we used throughout this book. gRPC provides low latency communication and smaller payloads if Protobuffers are used. gRPC was designed with APIs in mind, and the errors are more applicable to APIs. The downside is that the payloads are in a binary format which can make a quick inspection difficult.

Which Protocol to Use

On the first hand, it looks very convenient to communicate with the model server over REST. The endpoints are easy to infer, the payloads can be easily inspected, and the endpoints can be tested with curl requests or browser tools.

While gRPC APIs have a higher burden of entry initially, they can lead to significant performance improvements depending on the data structures required for the model inference. If your model experiences many requests, the reduced payload size from the Protobuf data formats can be useful.

Internally, TensorFlow Serving converts JSON data structures submitted via REST to tf.Example data structures and this can lead to slower performance. Therefore, you might see better performance with gRPC requests if the conversion requires many type conversions (e.g. if you submit a large array with Float values).

Making predictions from the Model Server

Until now, we have entirely focused on the model server setup. In this section, we want to demonstrate how a client, e.g., a web app, can interact with the model server. All code examples concerning REST or gRPC requests are executed on the client side.

Getting model predictions via REST

To call the model server over REST, you’ll need a Python library to facilitate the communication for you. The standard library these days is requests. After you installed the library with

$ pip install requests

The example below showcases an example POST request.

>>> from requests import HTTPSession

>>> http = HTTPSession()1
>>> url = 'http://some-domain.abc'
>>> payload = json.dumps({"key_1": "value_1"})

>>> r = http.request('post', url, payload)2
>>> r.json()3
{'data': ...}
1

Set up a connection pool

2

Submit the request

3

View the http response

URL Structure

The url for your http request to the model server contains information which model and which version you would like to infer.

http://{HOST}:{PORT}/v1/models/{MODEL_NAME}
HOST

The host is the IP address or domain name of your model server. If you run your model server on the same machine where you run your client code, you can set the host to localhost

PORT

You’ll need to specify the port in your request URL. The standard port for the REST API is 8501. If this conflicts with other services in your service ecosystem, you can change the port in your server arguments during the startup of the server.

MODEL_NAME

The model name needs to match the name of your model when you either configured your model configuration or when you started up the model server

http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/${MODEL_VERSION}]
MODEL_VERSION

If you want to make predictions from a specific model version, you’ll need to extend the URL with the version identifier. Earlier we talked about version labels. The labels become very handy to specify an exact version.

If you want to submit a data example to a classification or regression model, you’ll need to extend the URL with the classify or regress argument.

http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/${MODEL_VERSION}]:(classify|regress)

Payloads

With the URLs in place, let’s discuss the request payloads. TensorFlow Serving expects the input data as a JSON data structure shown in the following example.

{
  "signature_name": <string>,
  "instances": <value>
}

The signature_name is not required. If it isn’t specified, the model server will infer the model graph signed with the default serving label.

The input data is expected either as a list of objects or as a list of input values. If you want to submit multiple data samples, you can submit them as a list under the instances key.

{
  "signature_name": <string>,
  "inputs": <value>
}

If you want to submit one data example for the inference, you can use the inputs and list all input values as a list. One of the keys, instances and inputs, have to be present, but never both at the same time.

Example

With the following example, you can request a model inference. In our example, we only submit one data example for the inference, but your list could easily contain more examples.

>>> import json
>>> from requests import HTTPSession

>>> def rest_request(text):
>>>     url = 'http://localhost:8501/v1/models/movie:predict' 1
>>>     payload = json.dumps({"instances": [text]}) 2
>>>     response = http.request('post', url, payload)
>>>     return response

>>> rs_rest = rest_request(text="classify my text")
>>> rs_rest.json()
1

Exchange localhost with an IP address if the server is not running on the same machine

2

Add more examples to the instance list if you want to infer more samples.

Inferring TensorFlow Serving via gRPC

If you want to infer the model over gRPC, the steps are slightly different to the REST API requests.

First, you establish a gRPC channel. The channel provides the connection to the gRPC server at a given host address and over a given port. If you require a secure connection, you need to establish a secure channel at this point. Once the channel is established, you’ll create a stub. A stub is a local object which replicates the available methods from the server.

import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf

def create_grpc_stub(host, port='8500'):
    hostport = f'{host}:{port}'
    channel = grpc.insecure_channel(hostport)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    return stub

Once the gRPC stub is created, we can set the model and the signature to access predictions from the correct model and submit our data for the inference.

def grpc_request(stub, data_sample, model_name='my_model', signature_name='classification'):
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name

    request.inputs['inputs'].CopyFrom(tf.make_tensor_proto(data_sample, shape=[1,1])) 1
    result_future = stub.Predict.future(request, 10) 2
    return result_future
1

inputs is the name of the input neurons of our neural network

2

10 is the max time before the function times out

With the two function now available, we can infer our example data sets with the two function calls

stub = create_grpc_stub(host, port='8500')
rs_grpc = grpc_request(stub, data)

Getting Predictions from Classification and Regression Models

If you are interested in making predictions from classification and regression models, you can do that with the gRPC API.

If you would like to get predictions from a classification model, you need to swap out the following lines

from tensorflow_serving.apis import predict_pb2
...
request = predict_pb2.PredictRequest()

with

from tensorflow_serving.apis import classification_pb2
...
request = classification_pb2.ClassificationRequest()

If you want to get predictions from a regression model, you can use the following imports

from tensorflow_serving.apis import regression_pb2
...
regression_pb2.RegressionRequest()

Payloads

gRPC API uses ProtoBuffers as the data structure for the API request. By using ProtoBuffer payloads, the API requests are compressed and therefore use less bandwidth. Also, depending on the model input data structure, you might experience faster inferences as with the REST endpoints. The performance difference is explained by the fact that the submitted JSON data will be converted to a tf.Example data structure. This conversion can slow down the model server inference, and you might encounter a slower inference performance than in the gRPC API case.

Your data submitted to the gRPC endpoints needs to be converted to the ProtoBuffer data structure. TensorFlow provides you a handy utility function to perform the conversion called tf.make_tensor_proto. make_tensor_proto allows various data formats, including scalars, lists, NumPy scalars, and NumPy arrays. The function will then convert the given Python or NumPy data structures to the ProtoBuffer format for the inference.

Model A/B Testing with TensorFlow Serving

A/B testing is an excellent methodology to either test different models in real life situations or a way to phase in newer models and expose them to a small number of users before directing all model inferences to the newer model.

We discussed earlier that you could configure TensorFlow Serving to load multiple model versions and then specify the model version in your REST request URL or gRPC specifications.

TensorFlow Serving doesn’t support A/B Testing from the server-side, but with a little tweak to our request URL, we can support random A/B testing from the client-side.

from random import random 1

def get_rest_url(model_name, host='localhost', port='8501',
                 verb='predict', version=None):
    url = f"http://{host}:{port}/v1/models/{model_name}/"
    if version:
        url += f"versions/{version}"
    url += f":{verb}"
    return url

...

# submit 10% of all request from this client to version 1
# 90% of the request should go to the default models
threshold = 0.1
version = 1 if random() < threshold else None 2
url = get_rest_url(model_name='complaints_classification', version=version)
1

The random library will help us to pick a model

2

If version == None, TensorFlow Serving will infer with the default version

As you can see, randomly changing the request URL for our model inference (in our REST API example), can provide you some basic A/B testing functionality. If you would like to extend the capabilities by performing the random routing of the model inference on the server side, we highly recommend routing tools like Istio 4 for that purpose. Originally designed for web traffic, the tool can be used to route traffic to specific models. You can phase in models, perform A/B tests or create policies for data routed to specific models.

When you perform A/B tests with your models, it is often useful to request information about the model from the model server. In the following section, we will explain how you can request the metadata information from TensorFlow Serving.

Requesting Model Meta Data from the Model Server

At the beginning of the book we laid out the model life cycle and how we want to automate the machine learning life cycle. A critical component of the continuous life cycle is generating accuracy or general performance feedback about your model versions. We will deep dive into how to generate these feedback loops in a later chapter, but for now, imagine that your model classifies some data, e.g., the sentiment of the text, and then asks the user to rate the prediction. The information of whether a model predicted something correctly or incorrectly is precious to improve future model versions, but it is only useful if we know which model version has performed the prediction.

The metadata provided by the model server will contain the information to annotate your feedback loops.

REST Requests for Model Meta Data

Requesting model meta information is straight forward with TensorFlow Serving. TensorFlow Serving provides you an endpoint for model meta information.

http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/{MODEL_VERSION}]/metadata

Similar to the REST API inference requests we discussed earlier, you have the option to specify the model version in the request URL, or if you don’t specify it, the model server will provide the information about the default model.

import json
from requests import HTTPSession

def metadata_rest_request(model_name, host='localhost',
                          port='8501', version=None):
    url = f"http://{host}:{port}/v1/models/{model_name}/"
    if version:
        url += f"versions/{version}"
    url += f"/metadata"

    http = HTTPSession()
    response = http.request('get', url)
    return response

With one REST request, the model server will return the model specifications as a model_spec dictionary and the model siganture definitions as a metadata dictionary.

{
	"model_spec": {
		"name": "complaints_classification",
		"signature_name": "",
		"version": "1556583584"
	},
	"metadata": {
		"signature_def": {
			"signature_def": {
				"classification": {
					"inputs": {
						"inputs": {
							"dtype": "DT_STRING",
							"tensor_shape": {
                ...

You can then attach the model meta information to the information you want to store about your model performance which can be analyzed at a later point in time.

gRPC Requests for Model Meta Data

Requesting model meta information is almost as easy as we have seen it in the REST API case. In the gRPC case, you file a GetModelMetadataRequest, add the model name to the specifications and submit the request via the GetModelMetadata method of the stub.

from tensorflow_serving.apis import get_model_metadata_pb2

def get_model_version(model_name, stub):
    request = get_model_metadata_pb2.GetModelMetadataRequest()
    request.model_spec.name = model_name
    request.metadata_field.append("signature_def")
    response = stub.GetModelMetadata(request, 5)
    return response.model_spec

>>> model_name = 'complaints_classification'
>>> stub = create_grpc_stub('localhost')
>>> get_model_version(model_name, stub)

name: "complaints_classification"
version {
  value: 1556583584
}

The gRPC response contains ModelSpec object which contains the version number of the loaded model.

More interesting is the use-case of obtaining the model signature information of the loaded models. With almost the same request functions we can determine the model meta information. The only difference is that we don’t access the model_spec attribute of the response object, but the metadata. The information needs to be serialized to be human-readable; therefore we are using SerializeToString to convert the ProtoBuffer information.

from tensorflow_serving.apis import get_model_metadata_pb2

def get_model_meta(model_name, stub):
    request = get_model_metadata_pb2.GetModelMetadataRequest()
    request.model_spec.name = model_name
    request.metadata_field.append("signature_def")
    response = stub.GetModelMetadata(request, 5)
    meta =  response.metadata['signature_def']
    return meta.SerializeToString().decode("utf-8", 'ignore')

>>> model_name = 'complaints_classification'
>>> stub = create_grpc_stub('localhost')
>>> meta = get_model_meta(model_name, stub)

>>> print(meta)
type.googleapis.com/tensorflow.serving.SignatureDefMap
serving_default
complaints_classification_input
        input_1:0
               2@
complaints_classification_output(
dense_1/Softmax:0
               tensorflow/serving/predict

Batching Inference Requests

Batching inference requests is one of the most powerful features of TensorFlow Serving. During model training, batching accelerates our training because we can parallelize the computation of our training samples. At the same time, we can also use the computation hardware efficiently if we match the memory requirements of our batches with the available memory of the GPU.

TensorFlow_Serving_Without_Batching
Figure 4-2. Overview of the TensorFlow Serving Without Batching

As shown in [Link to Come], if you run TensorFlow Serving without the batching enabled, every client request with one or more data samples creates an inference of the model regardless of whether the memory is optimally used or not.

TensorFlow_Serving_Batching
Figure 4-3. Overview of the TensorFlow Serving Batching

As shown in [Link to Come], multiple clients can request model predictions and the model server batches the different client requests into one “block” to compute. Each request inferred through this batching step might take a bit longer than a single request. However, imagine a large number of individual requests hitting the server. In this case, you can process a more significant number of requests if processed as a batch.

Configure Batch Inferences

Batching inferences needs to be enabled for TensorFlow Serving and then configured for your use-case. You have five configuration options:

max_batch_size

This parameter controls the batch size. Large batch sizes will increase the request latency and can lead to an exhausting of the GPU memory. Small batch size lose the benefit of the optimal computation resource usage.

batch_timeout_micros

This parameter sets the maximum time to wait to fill a batch. This parameter is handy to cap the latency for inference requests.

num_batch_threads

The number of thread configures how many CPU or GPU cores can be used in parallel.

max_enqueued_batches

This parameter sets the maximum number of batches queued for inferences. This configuration is beneficial to avoid an unreasonable backlog of requests. If the maximum number is reached, requests will be returned with an error instead of being queued.

pad_variable_length_inputs

This boolean parameter determines if input tensors with variable lengths will be padded to the same lengths for all input tensors.

As you can imagine, setting the parameters for the optimal batching requires some tuning and is application dependent. If you run online inference, you should aim for limiting the latency. It is often recommended to set batch_timeout_micros initially to zero and tune the timeout towards 10000 microseconds. In contrast, batch requests will benefit from longer timeouts (milli-seconds to a second) to constantly use the batch size for optimal performance. TensorFlow Serving will make predictions on the batch when either the max_batch_size or the timeout is reached.

If you configure TensorFlow Serving for CPU inferences, set num_batch_threads to the number of CPU cores. If you configure a GPU setup, tune max_batch_size to get an optimal utilization of the GPU memory. While you tune your configuration, make sure that you set max_enqueued_batches to a huge number to avoid that some requests will be returned early without proper inference.

You can set the parameters in a text file as shown in the following example. In our example, we call the configuration file batching_parameters.txt.

max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
pad_variable_length_inputs: true

If you want to enable batching, you need to pass two additional parameters to the Docker container running TensorFlow Serving. Set enable_batching to true to enable batching and set batching_parameters_file to the absolute path of the batching configuration file inside of the container. Please keep in mind that you have to mount the additional folder with the configuration file if it isn’t located in the same folder as the model versions.

Here is a complete example of the docker run command to start the TensorFlow Serving Docker container with batching enabled. The parameters will then be passed to the TensorFlow Serving instance.

docker run -p 8500:8500 \
           -p 8501:8501 \
           --mount type=bind,source=/path/to/models,target=/models/my_model \
           --mount type=bind,source=/path/to/batch_config,target=/server_config \
           -e MODEL_NAME=my_model -t tensorflow/serving \
           --enable_batching=true
           --batching_parameters_file=/server_config/batching_parameters.txt

As explained earlier, the configuration of the batching will require additional tuning, but the performance gains should make up for the initial setup. We highly recommend enabling this TensorFlow Serving feature. It is especially useful for offline/batch processes to infer a large number of data samples.

Other TensorFlow Serving Optimizations

TensorFlow Serving comes with a variety of additional optimization features. Additional feature flags are:

--file_system_poll_wait_seconds=1

TensorFlow Serving will poll if a new model version is available. You can disable the feature by setting it to -1 or if you only want to load the model once and never update it, you can set it to 0. The parameter expects an integer value.

--tensorflow_session_parallelism=0

TensorFlow Serving will automatically determine how many threads to use for a TensorFlow session. In case, you want to set the number of a thread manually, you can overwrite it by setting this parameter to any positive integer value.

--tensorflow_intra_op_parallelism=0

This parameter sets the number of cores being used for running TensorFlow Serving. The number of available threads determines how many operations will be parallelized. If the value is zero, all available cores will be used.

--tensorflow_inter_op_parallelism=0

This parameter sets the number of available threads in a pool to execute TensorFlow ops. This is useful to maximize the execution of independent operations in a TensorFlow graph. If the value is set to zero, all available cores will be used and one thread per core allocated.

Similar to our earlier examples, you can pass the configuration parameter to the docker run command as shown in the following example:

docker run -p 8500:8500 \
           -p 8501:8501 \
           --mount type=bind,source=/path/to/models,target=/models/my_model \
           -e MODEL_NAME=my_model -t tensorflow/serving \
           --tensorflow_intra_op_parallelism=4 \
           --tensorflow_inter_op_parallelism=4 \
           --file_system_poll_wait_seconds=10 \
           --tensorflow_session_parallelism=2

Using TensorRT with TensorFlow Serving

If you are running computationally intensive deep learning models on an NVidia GPU, you have an additional way of optimizing your model server. NVidia provides a library called TensorRT which optimizes the inference of deep learning models by reducing the precision of the numerical representations of the network weights and biases. TensorRT supports int8 and float16 representations. The reduced precision will lower the inference latency of the model.

After your model is trained, you need to optimize the model with TensorRT’s own optimizer 5 or with saved_model_cli. The optimized model can then be loaded into TensorFlow Serving. At the time of writing this chapter, TensorRT was limited to some NVidia products including Tesla V100 and P4.

First, we’ll convert our deep learning model with saved_model_cli

$ saved_model_cli convert --dir saved_models/
                          --output_dir trt-savedmodel/
                          --tag_set serve tensorrt

and then load the model in our GPU setup of TensorFlow Serving

$ docker run --runtime=nvidia \
             -p 8500:8500 \
             -p 8501:8501 \
             --mount type=bind,source=/path/to/models,target=/models/my_model \
             -e MODEL_NAME=my_model
             -t tensorflow/serving:latest-gpu

If you are inferring on NVidia GPUs and your hardware is supported by TensorRT, switching to TensorRT can be an excellent way to lower your inference latencies further.

TensorFlow Serving Alternatives

TensorFlow Serving is a great way of deploying machine learning models. With the TensorFlow Estimators and Keras models, a large variety of machine learning concepts are covered. If you would like to deploy a legacy model or your machine learning framework of choice isn’t TensorFlow/Keras, here are a couple of options for you.

Seldon

The UK start-up Seldon provides a variety of open source tools to manage model life cycles, and one of the core products is Seldon Core 6. Seldon Core provides you a toolbox to wrap your models in a Docker image which is then deployed via Seldon in a Kubernetes Cluster.

At the time of writing this chapter, Seldon supported machine learning models written in Python, Java, NodeJS, and R.

Seldon comes with its own ecosystem which allows building the preprocessing into its own Docker images which are deployed in conjunction with the deployment images. It also provides its Routing service which allows you to perform A/B test or multiarm bandit experiments.

Seldon is highly integrated with the KubeFlow environment and, similar to TensorFlow Serving, is a way to deploy models with KubeFlow on Kubernetes.

GraphPipe

GraphPipe 7 is another way of deploying TensorFlow and non-TensorFlow models. Oracle drives the open source project. It allows you to deploy not just TensorFlow (inc. Keras) models, but also Caffe2 models and all machine learning models which can be converted to the ONNX format 8. Through the ONNX format you can deploy PyTorch models with GraphPipe.

Besides providing a model server for TensorFlow, PyTorch, etc., GraphPipe also provides client implementation for programming languages like Python, Java and Go.

Simple TensorFlow Serving

Simple TensorFlow Serving 9 supports more than just TensorFlow models. The current list of supported model frameworks includes ONNX, Scikit-learn, XGBoost, PMML, and H2O. It supports multiple models, inferences on GPUs and client code for a variety of languages.

One significant aspect of Simple TensorFlow Serving is that it supports authentication and encrypted connections to the model server. Authentication is currently not a feature of TensorFlow Serving and SSL/TLS supports requires a custom build of TensorFlow Serving.

MLflow

MLflow 10 supports the deployment of machine learning models, but it is only one aspect of the tool created by DataBricks. MLflow is designed to manage model experiments through MLflow Tracking. The tool has a model server built-in which provides REST API endpoints for the models managed through MLflow.

MLflow also provides interfaces to directly deploy the models from MLflow to Microsoft’s AzureML platform and Amazon’s Web Service SageMaker.

Deploying with Cloud Providers

All model server solutions we have discussed up to this point have to be installed and managed by you. However, all primary cloud providers, Google Cloud, Amazon Web Services and Microsoft Azure, offer machine learning products including the hosting of machine learning models.

In this section, we would like to walk you through one deployment option with a cloud provider.

Use Cases

Managed cloud deployments of machine learning models are a good alternative to running your model server instances if you want to deploy a model seamlessly and don’t want to worry about the scaling of the model deployment. All cloud providers offer deployment options with the ability to scale depending on the number of inference requests.

However, the flexibility of your model deployment comes at a cost. Managed services provide effortless deployments, but they cost a premium. For example, two model versions running full-time (requires two computation nodes) are more expensive than a comparable compute instance which is running a TensorFlow Serving instance. Another downside of managed deployments are the limitations of the products. Some cloud providers require that you deploy via their own Software Development Kits, others have limits on the node size and how much memory your model can take up. These limitations can be a severe restriction for sizeable deep learning models, especially if the models contain very many layers or a layer contains language model information.

Example Deployment with Google Cloud Platforms

In this section, we will guide you through an example deployment with Google Cloud’s AI Platform. Let’s start with the model deployment, and later we’ll explain how you can get predictions from the deployed model from your application client.

Model Deployment

The deployment consists of three steps:

  • Making the model accessible on Google Cloud

  • Create a new model instance with Google Cloud’s AI Platform

  • Create a new version with the model instance

The deployment starts with uploading your exported TensorFlow/Keras model to a storage bucket. As shown in [Link to Come], you need to upload the entire exported model. Once the upload of the model is done, please copy the complete path of the storage location.

Upload model to bucket
Figure 4-4. Uploading the Trained Model to a Cloud Storage

Once you have uploaded your machine learning model, head over to the AI Platform of Google Cloud Platform to set up your machine learning model for deployment. If it is the first time that you use the AI Platform in your GCP project, you’ll need to enable the API. The automated startup process by Google Cloud can take a few minutes.

When you create a new model, you need to give the model a unique identifier. Once you have created the identifier and created an optional project description, continue with the setup by clicking Create.

Setup a new model
Figure 4-5. Creating a new Model Instance

Once the new model is registered, you can create a new model version within the model. To do that, please click on [Link to Come] in the overflow menu.

Create_new_Model_Version
Figure 4-6. Creating a New Model Version

When you create a new model version, you configure a compute instance which is running your model. Google Cloud gives you a variety of configuration options. Important is the version name since you’ll reference the version name later in the client setup. Please set the Model URI to the storage path you saved in the earlier step.

Google Cloud AI Platform supports a variety of machine learning frameworks including XGBoost and SciKit-Learn.

Setting up the Version Details
Figure 4-7. Setting up the Instance Details

Google Cloud Platform also lets you configure how your model instance should scale in case your model experiences a large number of inference requests. As [Link to Come] shows, you can set the scaling behavior. You have two options:

  • Manual scaling

  • Auto-scaling

While the manual scaling gives you the option for setting the exact number of nodes available for the inferences of your model version, auto-scaling will give you the chance spin up and down the number of available nodes. If your nodes don’t experience any requests, the number of nodes could even drop to zero. Please note that if the autoscaling is dropping the number of nodes to zero, it will take some time to re-instantiate your model version with the next request hitting the model version endpoint. Also if you run inference nodes in the autoscaling mode, you’ll be billed in minute intervals with a minimum of 10 min. That means that one request will cost you at least 10 min of compute time.

Setup_up_the_model_scaling
Figure 4-8. Setting up the Scaling Details of the Model Instance

Once the entire model version is configured, Google Cloud spins up the instances for you. If everything is ready for model inferences, you see a green check icon next to the version name as shown in [Link to Come].

completed_model_version_setup
Figure 4-9. Completing the Deployment with a new Version available

You can run multiple model versions simultaneously. In the control panel of the model version, you set one version as the default version and any inference request without a version specified will be routed to the designated “default version”. Just note that each model version will be hosted on an individual node and accumulate Google Cloud Platform costs.

Model Inference

Since TensorFlow Serving is battle tested at Google and used heavily internally, it is also used behind the scenes at Google Cloud Platform. You’ll notice that the AI Platform isn’t just using the same model export format as we have seen with our TensorFlow Serving instances, but the payloads have the same data structure as we have seen before.

The only significant difference is the API connection. As you’ll see in this section, you’ll connect to the model version via the GCP API which is handling the request authentication.

To connect with the Google Cloud API, you’ll need to install the library google-api-python-client with

$ pip install google-api-python-client==1.7.8

All Google services can be connected via a service object. The helper function in the following code snippet highlights how to create the service object. The Google API client takes a service name and a service version and returns an object which provides all API functionalities via methods of the returned object.

import googleapiclient.discovery

def _connect_service():
   kwargs = {'serviceName': 'ml', 'version': 'v1'}
   return googleapiclient.discovery.build(**kwargs)

Similar to our earlier REST and gRPC examples, we nest our inference data under a fixed instances key which carries a list of input dictionaries. We have created a little helper function to generate the payloads. This function can contain any preprocessing if you need to modify your input data before the inference.

def _generate_payload(sentence):
   return {"instances": [{"sentence": sentence}]}

With the service object created on the client side and the payload generated, it’s time to request the prediction from the Google Cloud hosted machine learning model.

The service object of the AI Platform service contains a predict method which accepts a name and a body. The name is a path string containing your Google Cloud Platform project name, your model name and if you want to make predictions with a specific model version, your version name. If you don’t specify a version number, the default model version will be used for the model inference. The body contains the inference data structure we generated earlier.

project = 'yourGCPProjectName'
model_name = 'demo_model'
version_name = 'v1'
request = service.projects().predict(
  name=f'projects/{project}/models/{model_name}/versions/{version_name}',
  body=_generate_payload(sentence)
)
response = request.execute()

The Google Cloud AI Platform response contains the predict scores for the different categories similar to a REST response from a TensorFlow Serving instance.

{'predictions': [{'label': [
   0.9000182151794434,
   0.02840868942439556,
   0.009750653058290482,
   0.06182243302464485]
   }]
}

Summary

In this chapter, we discussed how to setup TensorFlow Serving to deploy machine learning models and why a model server is a more scalable option than deploying machine learning models through a Flask web application. We stepped through the installation and configuration steps, introduced the two main communication option, REST and gRPC, and briefly discussed the advantages and disadvantages of both communication protocols.

Futhermore, we explained some of the great benefits of TensorFlow Serving, including the batching of model requests and how to obtain meta information about the different model versions. We also discussed how to set up a quick A/B test setup with TensorFlow Serving.

We closed this chapter with a brief introduction of a managed cloud service, using Google Cloud AI Platform as an example. This provides you the ability to deploy machine learning models without managing your own server instances.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required