O'Reilly logo
live online training icon Live Online training

Debugging Data Science, Part 1: Evaluating Machine Learning in Practice

Jonathan Dinu

Data science is more widespread now than ever due to the emergence of powerful open source tools, the ubiquity of accessible learning materials, and the availability of valuable data. While these resources lower the barrier to getting started with machine learning, making the transition from the well-behaved datasets and tasks presented in introductory tutorials to the ill-defined problems of real world machine learning is often a challenging “exercise left to the reader.”

This training provides an invaluable, hands-on guide to applying machine learning in the wild. Using an end-to-end data science example, we will walk through the process of defining an appropriate problem, building and evaluating a machine learning model, and seeing how to take its performance to the next level with a variety of advanced techniques. The focus of this course, Debugging Data Science, Part 1, will be on evaluating machine learning models and troubleshooting problems that arise during the model training process. Part 1 wraps up by covering more nuanced evaluation techniques and a survey of methods for selecting the best model for a given task, which may not be just the most accurate model.

What you'll learn-and how you can apply it

  • Use scikit-learn to build machine learning models and evaluate them using advanced metrics.
  • Understand the importance of proper cross validation and how to use it to diagnose model learning problems.
  • Learn the basics of model selection and how to choose the best model for a specific problem.
  • Walk through an end-to-end applied machine learning problem applying cost-sensitive learning to optimize “profit.”

This training course is for you because...

  • You have taken an introductory data science course but want a second course to better understand how to apply machine learning to real-world problems and troubleshoot issues that might arise.
  • You are an aspiring data scientist looking to break into the field and need to learn the practical skills necessary for what you will encounter on the job.
  • You are a quantitative researcher interested in applying theory to real projects by taking a computational approach to modeling.
  • You are a software engineer interested in building intelligent applications driven by machine learning.

Prerequisites

  • Experience with an object-oriented programming language, e.g., Python (all code demos during the training will be in Python).
  • Familiarity with the basics of supervised machine learning (e.g. logistic regression) is helpful but not required.
  • A working knowledge of the scientific Python libraries (numpy, pandas and scikit-learn) is helpful but not required.

Course Set-up

Recommended Preparation

About your instructor

  • Jonathan Dinu is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII) where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he co-founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity.

    In addition to his professional data science experience, he has run data science trainings for a Fortune 100 company and taught workshops at Strata, PyData, & DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction to Machine Learning with scikit-learn 40 min

  • Supervised Learning with Logistic Regression (25 min)
  • Introduction to the scikit-learn API (15 min)
  • Q&A

Troubleshooting Learning Problems 50 min

  • Introduction to Cross Validation (20 min)
  • Regularization and Model Complexity (15 min)
  • Training Convergence and Learning Curves (15 min)

Break (10 min)

Choosing the Best Model 70 min

  • Advanced Evaluation Metrics (15 min)
  • Understanding Model Calibration (10 min)
  • Model Selection Criteria (20 min)
  • Decision Theory for Cost Sensitive Learning (25 min)

Q&A (5 min)