10

Improving Inference Efficiency

When a deep learning (DL) model is deployed on an edge device, inference efficiency is often unsatisfactory. These issues mostly come from the size of the trained network, as it requires a lot of computation. Therefore, many engineers and scientists often sacrifice accuracy for speed when deploying a DL model on an edge device. Furthermore, they focus on reducing the model size as edge devices often have limited storage space.

In this chapter, we will introduce techniques for improving the inference latency while maintaining the original performance as much as possible. First, we will cover network quantization, a technique that decreases the network size by using data formats of lower precision for model parameters. ...

Get Production-Ready Applied Deep Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.