O'Reilly logo
live online training icon Live Online training

Data Cleaning Essentials for Building Predictive Models with Python (Data Quality Series)

Dealing with missing values, malformed data, and outliers

Topic: Data
Janani Ravi

We live in a time of ever-more extreme phenomena. In the financial markets, in the frequency of floods, powerful storms, and other extreme climate events, it’s becoming increasingly difficult to determine whether something is an outlier or a harbinger of a new normal. And as more and more organizations rush to build similar models using similar technologies on similar datasets, the correct handling of outliers, novelties, missing values, and malformed data can make all the difference.

In this course—the first in a three-part series on data handling and feature engineering—expert Janani Ravi shows you how to use Python libraries to deal with malformed and missing data in the real world. You’ll also learn how to perform exploratory data analysis to understand the relationships that exist in your data. Join in to explore the techniques that will help you address possible outliers and novelties in your data modeling use case.

The Data Quality Series is a set of three live online training courses, meant to be followed in this order (although each is a standalone course):

  1. Data Cleaning Essentials for Building Predictive Models with Python (Data Quality Series)
  2. Data Prep Essentials for Building Predictive Models with Python (Data Quality Series)
  3. Data Processing Essentials for Building Predictive Models with Python (Data Quality Series)

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • How to work with real-world data and deal with missing values and outliers
  • How to visualize univariate, bivariate, and multivariate data
  • How to use visualization libraries such as seaborn and Plotly to identify relationships that exist in your data

And you’ll be able to:

  • Use Python libraries such as pandas and scikit-learn to perform data cleaning
  • Deal with missing data using deletion and imputation techniques
  • Identify outliers using visualization and z-scores
  • Eliminate outliers from your data

This training course is for you because...

  • You’re a business analyst who needs to make sense of large quantities of data of uncertain provenance and quality.
  • You’re a data scientist who wants to understand how to use the right data.
  • You’re a data engineer who’s noticed that a model that worked fine in testing isn’t working quite as well in practice.

Prerequisites

  • A working knowledge of Python and the Jupyter Notebook
  • A basic understanding of building and training ML models
  • Familiarity with regression and classification techniques in ML

Recommended preparation:

Recommended follow-up:

About your instructor

  • Janani Ravi is a cofounder of Loonycorn, a team dedicated to upskilling IT professionals. She’s been involved in more than 75 online courses in Azure and GCP. Previously, Janani worked at Google, Flipkart, and Microsoft. She completed her studies at Stanford.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Data collection (10 minutes)

  • Presentation: Data collection and analysis; problems encountered when working with real-world data
  • Jupyter Notebook exercise: Work with Jupyter notebooks
  • Q&A

Visualizing data (45 minutes)

  • Presentation: Picking the right visualization for univariate, bivariate, and multivariate data
  • Jupyter Notebook exercises: Use heat maps to understand correlations in data; use autocorrelation plots for time series data; use candlestick plots for financial data; perform exploratory data analysis on a real-world dataset
  • Group discussion: How to pick the right visualization based on your use case
  • Q&A

Break (5 minutes)

Cleaning data (55 minutes)

  • Presentation: Techniques to deal with missing data
  • Jupyter Notebook exercise: Delete and impute missing data; detect outliers using visualizations; detect outliers using z-scores; clean a real-world dataset that contains missing values and outliers using techniques learned in this session

Wrap-up and Q&A (5 minutes)