6 Operation patterns

This chapter covers

  • Recognizing areas of improvement in machine learning systems, such as job scheduling and metadata
  • Preventing resource starvation and avoiding deadlocks using scheduling techniques, such as fair-share scheduling, priority scheduling, and gang scheduling
  • Handling failures more effectively to reduce any negative effect on users via the metadata pattern

In chapter 5, we focused on machine learning workflows and the challenges of building them in practice. Workflow is an essential component in machine learning systems as it connects all components in the system. A machine learning workflow can be as easy as chaining data ingestion, model training, and model serving. It can also be very complex when handling ...

Get Distributed Machine Learning Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.