Chapter 3: Building a Data Parallel Training and Serving Pipeline

In the previous chapter, we discussed the two main-stream data parallel training paradigms, parameter server and All-Reduce. Due to the shortcomings of the parameter server paradigm, the mainstream solution for data parallel training is the All-Reduce architecture. We will illustrate our implementation using the All-Reduce paradigm.

In this chapter, we will mainly focus on the coding side of data parallelism. Before we dive into the details, we will list the assumptions we have for the implementations in this chapter:

  • We will use homogenous hardware for all our training nodes.
  • All our training nodes will be exclusively used for a single job, which means no resource sharing ...

Get Distributed Machine Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.