Chapter 3. Data Preprocessing
In this chapter, you’ll learn how to prepare and set up data for training. Some of the most common data formats for ML work are tables, images and text. There are commonly practiced techniques associated with each, though how you set up your data engineering pipeline will of course depend on what your problem statement is and what you are trying to predict.
I’ll look at all three formats in detail, using specific examples to walk you through the techniques. All of the data can be read directly into your Python runtime memory; however, this is not the most efficient way to use your compute resources. When I discuss text data, I’ll give particular attention to tokenization and dictionaries. By the end of this chapter, you will learn how to prepare table, image, and text data for training.
Preparing tabular data for training
In a tabular dataset, it is important to identify which ...