Chapter 15. Optimizing Latency
Embedded systems don’t have much computing power, which means that the intensive calculations needed for neural networks can take longer than on most other platforms. Because embedded systems usually operate on streams of sensor data in real time, running too slowly can cause a lot of problems. Suppose that you’re trying to observe something that might occur only briefly (like a bird being visible in a camera’s field of view). If your processing time is too long you might sample the sensor too slowly and miss one of these occurrences. Sometimes the quality of a prediction is improved by repeated observations of overlapping windows of sensor data, in the way the wake-word detection example runs a one-second window on audio data for wake-word spotting, but moves the window forward only a hundred milliseconds or less each time, averaging the results. In these cases, reducing latency lets us improve the overall accuracy. Speeding up the model execution might also allow the device to run at a lower CPU frequency, or go to sleep in between inferences, which can reduce the overall energy usage.
Because latency is such an important area for optimization, this chapter focuses on some of the different techniques you can use to reduce the time it takes to run your model.
First Make Sure It Matters
It’s possible that your neural network code is such a small part of your overall system latency that speeding it up wouldn’t make a big difference to your product’s ...
Get TinyML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.