Chapter 2Exploring Data and Building Data Pipelines

Data collecting and cleaning is a major step in machine learning as your model is only as good as your data. Most of the time in machine learning is spent cleaning data and feature engineering. In this chapter, we will focus on data cleaning and exploratory data analysis (EDA). We will talk about data visualization and statistical techniques to check for bad data (omitted values, outliers, duplicate values). Then we will cover how to normalize the data and handle bad data (such as having a data schema), how to handle missing data, and how to check for data leakage. We will also cover how you can use TensorFlow Data Validation (TFDV) to validate data for large‐scale systems.

Visualization

Data visualization is a data exploratory technique to find trends and outliers in the data. Data visualization helps in the data cleaning process because you can find out whether your data is imbalanced by visualizing the data on a chart. It also helps in the feature engineering process because you can select features and discard features and see how a feature will influence your model by visualizing it.

There are two ways to visualize data:

Get Official Google Cloud Certified Professional Machine Learning Engineer Study Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.