O'Reilly logo
live online training icon Live Online training

Cleaning data at scale

Boosting performance of industrial data science

Powered by Jupyter logo

Dr. Philip Winder

The internet is full of examples of how to train models. But the reality is that most of the time spent on industrial projects involves working with data. Thus, the largest improvements in performance can often be found through improving the underlying data.

In this hands-on three-hour course, expert Philip Winder teaches you fundamental techniques to improve and make the best use of your data. You'll learn how to impute missing data, clean corrupted data, remove anomalies, and convert features into a suitable format. You'll also discover how and why you should be transforming features and how to generate new features to boost performance.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • Why improving data quality improves results and performance
  • The many ways in which data can become corrupt
  • Why the type of data affects data cleaning
  • Why derived data can be better than the original

And you’ll be able to:

  • Determine when and how to clean data
  • Spot different types of corruption
  • Transform the data to produce better representations of the original
  • Clean all types of data: categorical, continuous, time series, etc.

This training course is for you because...

  • You're an engineer who has to clean and improve data to remove anomalies (such as for monitoring purposes).
  • You're a data scientist who has to clean and improve data to make solutions more robust, more performant, and simpler.

Prerequisites

  • Familiarity with Python
  • A working knowledge of basic statistics

Recommended preparation:

Recommended follow-up:

About your instructor

  • Dr. Philip Winder is a multidisciplinary Engineer who creates data-driven software products. His work incorporates Data Science, Cloud Native and traditional software development using a range of languages and tools.

    Phil is the CEO of Winder, a Data Science consultancy in the UK, which operates throughout Europe delivering training, development and consultancy services. He has Ph.D. and a Masters degree in Electronics from the University of Hull, UK.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (10 minutes)

  • Lecture: How bad data affects results; why bad data affects results; examples of bad data; data availability; data consistency; data leakage
  • Group discussion: How has bad data has affected your projects?
  • Hands-on exercise: Analyze data visualizations and attempt to spot all the different types of bad data

Types of data (20 minutes)

  • Lecture: Types of data
  • Hands-on exercises: Explore data provided in a notebook

Corrupted data (20 minutes)

  • Lecture: Fixing missing data; improving noisy data; detecting anomalies; fixing inconsistencies
  • Hands-on exercises: Fix missing data in a notebook; improve noisy data in a notebook; detect and remove outliers

Break (10 minutes)

Transforming data (40 minutes)

  • Lecture: Statistical distributions; non-normal data; why normality is important; returning data to normality; model- and domain-based transformations; arbitrary transformations
  • Group discussion: Why is normality important?; what transformations exist in your domain?
  • Hands-on exercises: Establish whether data in notebook is normal or not; renormalize non-normal data; transform example data in the notebook

Break (10 minutes)

Working with scales (10 minutes)

  • Lecture: Why scale is important; altering the scale of numerical values; handling categorical data
  • Hands-on exercise: Alter the scale of example data and work with categorical data in the notebook

Derived variables (10 minutes)

  • Lecture: Why new data can be better than the original; domain-specific feature generation; brute-force feature generation
  • Group discussion: Can feature extraction ever be automatic?

Feature selection (20 minutes)

  • Lecture: Why select features?; how to select features
  • Hands-on exercise: Improve the performance of a model through feature generation and selection in the notebook

Series variables (20 minutes)

  • Lecture: How ordered data differs; applying preprocessing techniques to series data
  • Group discussion: How can time series data be corrupted?
  • Hands-on exercises: Clean an example time series in the notebook

Wrap-up and Q&A (10 minutes)

  • Lecture: Related techniques; dimensionality reduction—aggregation, MDS, ICA, PCA, T-SNE, etc.; data integration