O'Reilly logo
live online training icon Live Online training

Debugging Data Science, Part 2: Tuning Models, Engineering Features, and Improving Performance

Jonathan Dinu

Data science is more widespread now than ever due to the emergence of powerful open source tools, the ubiquity of accessible learning materials, and the availability of valuable data. While these resources lower the barrier to getting started with machine learning, making the transition from the well-behaved datasets and tasks presented in introductory tutorials to the ill-defined problems of real world machine learning is often a challenging “exercise left to the reader”.

This training provides an invaluable, hands-on guide to applying machine learning in the wild. Using an end-to-end data science example, we will walk through the process of defining an appropriate problem, building and evaluating a machine learning model, and seeing how to take its performance to the next level with a variety of advanced techniques. The focus of this course, Debugging Data Science, Part 2, will be on methods to tune machine learning models for maximum performance and on data augmentation strategies to get the most out of what data you do have. Part 2 wraps up with a section on more advanced techniques to efficiently search large hyperparameter spaces and on cutting edge strategies for intelligently acquiring more labeled data to optimize model performance.

What you'll learn-and how you can apply it

  • Understand how to tune machine learning parameters and ensemble models to maximize predictive power.
  • Use feature engineering and active learning to improve models through data.
  • Learn how to use advanced techniques to optimize model hyperparameters efficiently and automatically augment input data for increased performance.

This training course is for you because...

  • You have taken an introductory data science course but want a second course to better understand how to apply machine learning to real world problems and troubleshoot issues that might arise.
  • You are an aspiring data scientist looking to break into the field and need to learn the practical skills necessary for what you will encounter on the job.
  • You are a quantitative researcher interested in applying theory to real projects by taking a computational approach to modeling.
  • You are a software engineer interested in building intelligent applications driven by machine learning.

Prerequisites

  • Experience with an object-oriented programming language, e.g., Python (all code demos during the training will be in Python)
  • Familiarity with the basics of supervised machine learning (e.g. logistic regression) is helpful but not required.
  • A working knowledge of the scientific Python libraries (numpy, pandas and scikit-learn) is helpful but not required.

Course Set-up

Recommended Preparation

Recommended Follow-up

About your instructor

  • Jonathan Dinu is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII) where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he co-founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity.

    In addition to his professional data science experience, he has run data science trainings for a Fortune 100 company and taught workshops at Strata, PyData, & DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Tuning Models and Hyperparameter Optimization 55 min

  • Understanding the bias-variance trade-off
  • Model Regularization
  • Bruteforce Hyperparameter Optimization
  • Model Averaging and Ensembles

Q&A (5 min)

Improving Performance through Data 50 min

  • Learning Curves Review
  • Troubleshooting Imbalanced Classes
  • Feature Selection Metrics and Filter Methods
  • Model Based Automatic Feature Selection

Break (10 min)

Advanced Techniques 55 min

  • Distributed Hyperparameter Optimization
  • list text hereBayesian Hyperparameter Optimization
  • Active Learning and Weak Supervision

Q&A (5 min)