3 Distributed training patterns

This chapter covers

  • Distinguishing the traditional model training process from the distributed training process
  • Using parameter servers to build models that cannot fit in a single machine
  • Improving distributed model training performance using the collective communication pattern
  • Handling unexpected failures during the distributed model training process

The previous chapter introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process in a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.

Distributed training, the next step after ...

Get Distributed Machine Learning Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.