O'Reilly logo
live online training icon Live Online training

Anomaly Detection: Cleaning Data for Effective Data Science

Topic: Data
David Mertz, Ph.D.

This course teaches you how to detect characteristics problems in the data sets you work with, by means of statistical tests and examination of bounds on reasonable values.

In this webinar on anomaly detection you will learn how to detect those cases where data goes bad in the course of its collection, collation, transmission, and transcription. Perhaps an instrument gives a bad reading some or all of the time. Perhaps some values are systematically altered in the course of reencoding to a different data format. Perhaps the wrong units of measure were used for a subset of the data. You will learn to detect the patterns that exist in the data itself as a result of recording errors.

Cleaning Data for Effective Data Science is a series of live online trainings, and soon a book, that address the central skills required of all data scientists, data analysts, and other developers in data-oriented domains.

All the things a data scientist needs to do before they do what is usually called data science is covered in a broad range of topics:

  • Data Ingestion of Tabular Formats
  • Value Imputation
  • Anomaly Detection
  • Feature Engineering
  • Data Ingestion of Hierarchical and Other Data Formats

What you'll learn-and how you can apply it

  • How to detect and process missing data which is sometimes marked in surprising ways
  • How to identify coding errors in categorical and string data
  • How to find outliers in numeric data
  • How to find outliers in the interaction or combination of multiple numeric features
  • How to improve your downstream data analysis and modeling

This training course is for you because...

  • You need to do the actual work of a data scientist, all the many hours that come before the theory of machine learning, statistics, data visualization.


  • Students in this course should have a comfortable intermediate level knowledge of Python and/or R, and a very basic familiarity with statistics.

Course Set-up:

Students should have a system with Jupyter Notebooks installed; a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib; and the general scientific Python tool stack. The training materials will be made available as notebooks at a GitHub repository. Optionally, installing R will allow students to follow some additional examples.

Recommended Preparation:

If you need to brush up on the prerequisites, consider the following:

Recommended Follow-up:

About your instructor

  • David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction (15 minutes)

Segment 2: Missing data and sentinels (45 minutes)

Break 10 minutes

Segment 3: Miscoded data (30 minutes)

Segment 4: Fixed bounds (30 minutes)

Break 10 minutes

Segment 5: Outliers (45 minutes)

Break 10 minutes

Segment 6: Multivariate outliers (30 minutes)

Segment 7: Exercises (15 minutes)

Course wrap-up and next steps