2 Data ingestion patterns

This chapter covers

  • Understanding data ingestion and its responsibilities
  • Handling large datasets in memory by consuming smaller datasets in batches (the batching pattern)
  • Preprocessing extremely large datasets as smaller chunks on multiple machines (the sharding pattern)
  • Fetching and re-accessing the same dataset for multiple training rounds (the caching pattern)

Chapter 1 discussed the growing scale of modern machine learning applications such as larger datasets and heavier traffic for model serving. It also talked about the complexity and challenges of building distributed systems--distributed systems for machine learning applications in particular. We learned that a distributed machine learning system is usually ...

Get Distributed Machine Learning Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.