Cleaning data at scale
Boosting performance of industrial data science
The internet is full of examples of how to train models. But the reality is that most of the time spent on industrial projects involves working with data. Thus, the largest improvements in performance can often be found through improving the underlying data.
In this hands-on three-hour course, expert Philip Winder teaches you fundamental techniques to improve and make the best use of your data. You'll learn how to impute missing data, clean corrupted data, remove anomalies, and convert features into a suitable format. You'll also discover how and why you should be transforming features and how to generate new features to boost performance.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- Why improving data quality improves results and performance
- The many ways in which data can become corrupt
- Why the type of data affects data cleaning
- Why derived data can be better than the original
And you’ll be able to:
- Determine when and how to clean data
- Spot different types of corruption
- Transform the data to produce better representations of the original
- Clean all types of data: categorical, continuous, time series, etc.
This training course is for you because...
- You're an engineer who has to clean and improve data to remove anomalies (such as for monitoring purposes).
- You're a data scientist who has to clean and improve data to make solutions more robust, more performant, and simpler.
- Familiarity with Python
- A working knowledge of basic statistics
- Watch Introduction to Python (video, 3h 28m)
- Watch Intermediate Python (video, 2h 56m)
- Watch Hands-On Machine Learning with Python (video, 2h 39m)
- Watch Machine Learning with Python (video, 5h 17m)
- Explore the course companion website
- Watch Deploying Spark ML Pipelines in Production on AWS (video, 23m)
- Watch An Introduction to Machine Learning Models in Production (video, 39m)
- Explore Building Machine Learning Pipelines Using Spark, Docker, and AWS (Learning Path, 2h 38m)
- Watch Apache Kafka Series: Kafka Streams for Data Processing (video, 4h 46m)
- Read An Introduction to Apache Flink (book)
- Watch Deploying Machine Learning Models as Microservices Using Docker (video, 24m)
About your instructor
Dr. Philip Winder is a multidisciplinary Engineer who creates data-driven software products. His work incorporates Data Science, Cloud Native and traditional software development using a range of languages and tools.
Phil is the CEO of Winder, a Data Science consultancy in the UK, which operates throughout Europe delivering training, development and consultancy services. He has Ph.D. and a Masters degree in Electronics from the University of Hull, UK.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (10 minutes)
- Lecture: How bad data affects results; why bad data affects results; examples of bad data; data availability; data consistency; data leakage
- Group discussion: How has bad data has affected your projects?
- Hands-on exercise: Analyze data visualizations and attempt to spot all the different types of bad data
Types of data (20 minutes)
- Lecture: Types of data
- Hands-on exercises: Explore data provided in a notebook
Corrupted data (20 minutes)
- Lecture: Fixing missing data; improving noisy data; detecting anomalies; fixing inconsistencies
- Hands-on exercises: Fix missing data in a notebook; improve noisy data in a notebook; detect and remove outliers
Break (10 minutes)
Transforming data (40 minutes)
- Lecture: Statistical distributions; non-normal data; why normality is important; returning data to normality; model- and domain-based transformations; arbitrary transformations
- Group discussion: Why is normality important?; what transformations exist in your domain?
- Hands-on exercises: Establish whether data in notebook is normal or not; renormalize non-normal data; transform example data in the notebook
Break (10 minutes)
Working with scales (10 minutes)
- Lecture: Why scale is important; altering the scale of numerical values; handling categorical data
- Hands-on exercise: Alter the scale of example data and work with categorical data in the notebook
Derived variables (10 minutes)
- Lecture: Why new data can be better than the original; domain-specific feature generation; brute-force feature generation
- Group discussion: Can feature extraction ever be automatic?
Feature selection (20 minutes)
- Lecture: Why select features?; how to select features
- Hands-on exercise: Improve the performance of a model through feature generation and selection in the notebook
Series variables (20 minutes)
- Lecture: How ordered data differs; applying preprocessing techniques to series data
- Group discussion: How can time series data be corrupted?
- Hands-on exercises: Clean an example time series in the notebook
Wrap-up and Q&A (10 minutes)
- Lecture: Related techniques; dimensionality reduction—aggregation, MDS, ICA, PCA, T-SNE, etc.; data integration