O'Reilly logo
live online training icon Live Online training

Debugging Data Science

Hands-on applied machine learning with Python

Jonathan Dinu

Data science is more widespread now than ever due to the rise of powerful open source tools, the ubiquity of accessible learning materials, and the availability of meaningful data. While these resources lower the barrier to get started with machine learning, making the transition from the well-behaved datasets and tasks presented in introductory tutorials to real world machine learning with messy, ill-defined problems is often left as an exercise for the reader.

This training provides an invaluable, hands-on guide to applying machine learning in the wild. Through an end-to-end data science example, we will walk through the process of defining an appropriate problem, building and evaluating a model, and see how to take its performance to the next level through a variety of more advanced techniques. The focus will be on debugging machine learning problems that arise during the model training process and seeing how to overcome these issues to improve the effectiveness of the model.

What you'll learn-and how you can apply it

  • Use scikit-learn to build machine learning models and evaluate them using advanced metrics to diagnose learning problems.
  • Improve the performance of a machine learning model through feature selection, data augmentation, and hyperparameter optimization.
  • Walk through an end-to-end applied machine learning problem applying cost-sensitive learning to optimize “profit.”

This training course is for you because...

  • You have taken an introductory machine learning or data science course but want a “second course” in machine learning to understand how to effectively apply the theory to real world problems and troubleshoot issues that might arise.
  • You are an aspiring data scientist looking to break into the field and need to learn the practical skills necessary for what you will encounter on the job.
  • You are a quantitative researcher interested in applying theory to real projects by taking a computational approach to modeling.
  • You are a software engineer interested in building intelligent applications driven by machine learning.


  • Experience with an object-oriented programming language, e.g., Python (all code demos during the training will be in Python)
  • Familiarity with the basics of supervised machine learning.
  • A working knowledge of the scientific Python libraries (pandas and scikit-learn) is helpful but not required.

Course Set-up

Recommended Preparation

Recommended Follow-up

About your instructor

  • Jonathan Dinu is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII) where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he co-founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity.

    In addition to his professional data science experience, he has run data science trainings for a Fortune 100 company and taught workshops at Strata, PyData, & DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.


The timeframes are only estimates and may vary according to how the class is progressing

Specifying Learnable Problems (25 min)

  • Exploratory Data Analysis with pandas
  • Data Validation
  • Quantifying Success

Q&A (5 min)

Building and Evaluating Models (50 min)

  • Supervised Learning Review
  • Introduction to Cross Validation
  • Interpreting your Model
  • Advanced Evaluation Metrics

Break (10 min)

Improving Machine Learning Models (55 min)

  • Troubleshooting Imbalanced Classes
  • Model Selection and Hyperparameter Optimization
  • Feature Selection and Engineering
  • Cost-sensitive Learning

Q&A (5 min)

Machine Learning in the Wild (25 min)

  • Evaluating Deployed Models
  • Data Augmentation and Crowdsourcing

Q&A (5 min)