O'Reilly logo
live online training icon Live Online training

Feature Engineering: Cleaning Data for Effective Data Science

Topic: Data
David Mertz, Ph.D.

This course shows you how to create more useful features out of the raw form of the data you receive, by engineering features more suited to your machine learning or data science task.

In this live online training you will learn how to work with datetime fields, including fills and interpolations. You will work with normalizing string fields. Beyond more isolated cleanup, you will learn how to tailor the features of your data set better to serve your data science purpose. Decompositions such as PCA (principal component analysis) may accentuate the information content. Categorical data should usually be one-hot encoded, which you will learn to perform. Moreover, often deterministic combinations of raw features may be combined using polynomial features to produce more useful inputs.

Cleaning Data for Effective Data Science is a series of live online trainings, and soon a book, that addresses the central skills required of all data scientists, data analysts, and other developers in data-oriented domains.

All the things a data scientist needs to do before they do what is usually called data science is covered in a broad range of topics:

  • Data Ingestion of Tabular Formats
  • Value Imputation
  • Anomaly Detection
  • Feature Engineering
  • Data Ingestion of Hierarchical and Other Data Formats

What you'll learn-and how you can apply it

  • How to fill and interpolate datatime series data
  • How transform the parametric space of your data (decomposition)
  • How to one-hot encode categorical data
  • How to generate new features by polynomial combination of raw features

This training course is for you because...

  • You need to do the actual work of a data scientist, all the many hours that come before the theory of machine learning, statistics, data visualization.


  • Students in this course should have a comfortable intermediate level knowledge of Python and/or R, and a very basic familiarity with statistics.

Course Set-up:

Students should have a system with Jupyter Notebooks installed; a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib; and the general scientific Python tool stack. The training materials will be made available as notebooks at a GitHub repository. Optionally, installing R will allow students to follow some additional examples.

Recommended Preparation:

If you need to brush up on the prerequisites, consider the following:

Recommended Follow-up:

About your instructor

  • David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.


The timeframes are only estimates and may vary according to how the class is progressing

  • Segment 1: Introduction (15 minutes)
  • Segment 2: Date/time fields (45 minutes)

Break 15 minutes

  • Segment 3: String fields (45 minutes)
  • Segment 4: On-hot encoding (15 minutes)

Break 15 minutes

  • Segment 5: Decompositions (45 minutes)
  • Segment 6: Polynomial features (30 minutes)
  • Segment 7: Exercises (15 minutes)