Chapter 5. Data Pipelines for Streaming Ingestion

Data ingestion is an important part of your workflow. There are several steps to perform before raw data is in the correct input format expected by the model. These steps are known as the data pipeline. Steps in a data pipeline are important because they will also be applied to the production data, which is the data consumed by the model when the model is deployed. Whether you are in the process of building and debugging a model or getting it ready for deployment, you need to format the raw data for the model’s consumption.

It is important to use the same series of steps in the model-building process as you do in deployment planning, so that the test data is processed the same way as the training data.

In Chapter 3 you learned how the Python generator works, and in Chapter 4 you learned how to use the flow_from_directory method for transfer learning. In this chapter, you will see more of the tools that TensorFlow provides to handle other data types, such as text and numeric arrays. You’ll also learn how to handle another type of file structure for images. File organization becomes especially important when handling text or images for model training because it is common to use directory names as labels. This chapter will recommend a practice for directory organization when it comes to building and training a text or image classification model.

Streaming Text Files with the text_dataset_from_directory Function

You can stream pretty ...

Get TensorFlow 2 Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.