Chapter 2. Data Storage and Ingestion

To envision how to set up an ML model to solve a problem, you have to start thinking about data structure patterns. In this chapter, we’ll look at some general patterns in storage, data formats, and data ingestion. Typically, once you understand your business problem and set it up as a data science problem, you have to think about how to get the data into a format or structure that your model training process can use. Data ingestion during the training process is fundamentally a data transformation pipeline. Without this transformation, you won’t be able to deliver and serve the model in an enterprise-driven or use-case-driven setting; it would remain nothing more than an exploration tool and would not be able to scale to handle large amounts of data.

This chapter will show you how to design a data ingestion pipeline for two common data structures: tables and images. You will learn how to make the pipeline scalable by using TensorFlow’s APIs.

Data streaming is the means by which the data is ingested in small batches by the model for training. Data streaming in Python is not a new concept. However, grasping it is fundamental to understanding how the more advanced APIs in TensorFlow work. Thus, this chapter will start with Python generators. Then we’ll look at how tabular data is stored, including how to indicate and track features and labels. We’ll then move to designing your data structure, and finish by discussing how to ingest data to your ...

Get TensorFlow 2 Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.