O'Reilly logo
live online training icon Live Online training

Value Imputation: Cleaning Data for Effective Data Science

David Mertz, Ph.D.

In this live online training you will learn how to impute values for data, both at the level of individual problematic data points and more globally as patterns in a data set. Sometimes data is

imputed based on typical values for a particular features. Other times, imputation can be based on locality in parametric space or on trends in overall data, all of which you will become familiar with. Beyond imputing new values for particular data points, you will learn to balance your data set overal, by using undersampling and oversampling techniques.

Cleaning Data for Effective Data Science is a series of live online trainings, and soon a book, that address the central skills required of all data scientists, data analysts, and other developers in data-oriented domains.

All the things a data scientist needs to do before they do what is usually called data science is covered in a broad range of topics:

  • Data Ingestion of Tabular Formats
  • Value Imputation
  • Anomaly Detection
  • Feature Engineering
  • Data Ingestion of Hierarchical and Other Data Formats

What you'll learn-and how you can apply it

  • How to identify and impute typical values to missing or unreasonable data
  • How to analyze trends in data for refined imputation
  • How to use oversampling and undersampling to produce more balanced data sets

This training course is for you because...

  • You need to do the actual work of a data scientist, all the many hours that come before the theory of machine learning, statistics, data visualization.

Prerequisites

  • Students in this course should have a comfortable intermediate level knowledge of Python and/or R, and a very basic familiarity with statistics.

Course Set-up:

Students should have a system with Jupyter Notebooks installed, a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib, and the general scientific Python tool stack. The training materials will be made available as notebooks at a GitHub repository. Optionally, installing R will allow students to follow some additional examples.

Recommended Preparation:

If you need to brush up on the prerequisites, consider the following:

Recommended Follow-up:

About your instructor

  • David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

  • Segment 1: Typical tabular data (30 minutes)
  • Segment 2: Locality imputation (30 minutes)

Break 15 minutes

  • Segment 3: Trend imputation; types of trends (15 minutes)
  • Segment 4: A larger coarse time series trend (30 minutes)
  • Segment 5: Non-temporal trends (30 minutes)

Break 15 minutes

  • Segment 6: Undersampling (30 minutes)
  • Segment 7: Oversampling (30 minutes)
  • Segment 8: Exercises (15 minutes)