Chapter 3. Data Preprocessing
In this chapter, you’ll learn how to prepare and set up data for training. Some of the most common data formats for ML work are tables, images, and text. There are commonly practiced techniques associated with each, though how you set up your data engineering pipeline will, of course, depend on what your problem statement is and what you are trying to predict.
I’ll look at all three formats in detail, using specific examples to walk you through the techniques. All the data can be read directly into your Python runtime memory; however, this isn’t the most efficient way to use your compute resources. When I discuss text data, I’ll give particular attention to tokenization and dictionaries. By the end of this chapter, you’ll have learned how to prepare table, image, and text data for training.
Preparing Tabular Data for Training
In a tabular dataset, it is important to identify which columns are considered categorical, because you have to encode their value as a class or a binary representation of the class (one-hot encoding), rather than a numerical value. Another aspect of tabular datasets is the potential for interactions among multiple features. This section will also look at the API that TensorFlow provides to make it easier to model column interactions.
It’s common to encounter tabular datasets as CSV files or simply as structured output from a database query. For this example, we’ll start with a dataset that’s already in a pandas DataFrame and ...