O'Reilly logo
live online training icon Live Online training

Advanced Machine Learning: How to Effectively Work with Imbalanced Data

Topic: Data
Noureddin Sadawi

One of the most common problems in machine learning (ML) is the fact that real-world data is not always clean and nice as the data we normally use in training courses. Real-world data can be noisy (i.e. having irrelevant examples), can have missing values, can be imbalanced and so on. An imbalanced dataset is a dataset that contains more than one class and the number of instances (or examples) in each class is not approximately the same. One might train a model that yields 90% accuracy but in reality it could be that the model is predicting the same class for the vast majority (if not all) of the testing data. Most of the machine learning algorithms used for classification were designed with the assumption that datasets have the same number of instances for each class. This poses a challenge for such models and can cause many issues including overfitting (which happens when a model fits the data it is trained on too well but fails to perform well on new unseen data).

In this course, you will gain highly valuable knowledge and learn how to deal with the real-life challenge of imbalanced classification. The course shows how to develop a practical intuition for imbalanced classification datasets and how to overcome this problem using one or a combination of several methods. In more detail, the course will show you how to modify imbalanced datasets and transform them into balanced datasets. You will also learn how to use and tune specific classifiers designed to deal with imbalanced datasets and choose the right metrics to quantify the performance in an imbalanced classification scenario. You will be given access to many professionally written Python code examples so that you can use it for your own projects and purposes.

This course is focused on a very common problem with machine learning in real life, that of imbalanced classification (something that is usually omitted in machine learning training courses where data is usually balanced and easy to deal with). For example, when dealing with medical data, it is common to have many more people who do not have a disease of interest (negative examples) than people who have the disease (positive examples).

In this course you will learn about the many techniques to use when you encounter an imbalanced classification problem. You will gain immense knowledge that will empower you as a data scientist and machine learning specialist.

When you’re not specifically paying attention to class imbalance, your models can have poor predictive performance. This is normally highlighted when the performance of the model on the minority class is analyzed. This causes a problem in real life because, more often than not, the minority class is the class of interest (i.e. it is more important than the majority class as classification errors in the minority class are highly costly when compared to classification errors in the majority class).

What you'll learn-and how you can apply it

  • Develop understanding of the challenge of imbalanced classification
  • Develop an understanding and intuition for imbalanced datasets
  • Become familiar with various classification metrics that can help you evaluate the performance of the classifiers you are using
  • Become an expert in evaluating imbalanced classification models
  • Use various techniques to update the imbalanced dataset and transform it into balanced data (e.g. understand and apply undersampling and oversampling techniques)
  • Develop and understanding of how cost-sensitive classifiers work and become proficient in using them correctly
  • Have a collection of professionally written Python code examples that you can use and adapt in your own projects
  • Apply the above techniques to interesting and challenging situations

This training course is for you because...

  • You are familiar with Python and machine/deep learning and want to use them to build many fully functioning real life applications
  • You would like to learn how to deal with the common problem of imbalanced classification
  • You would like to become familiar and comfortable with selecting the correct classification metrics to evaluate the performance of the classifiers you are using (especially when your dataset is imbalanced)
  • You would like to become an expert in using various techniques to update the imbalanced dataset and transform it into balanced data
  • You would like to understand how cost-sensitive classifiers work and become proficient in using them correctly
  • You would like to have a great collection of Python functions and scripts that can help you do all of the above in any of your projects

Prerequisites

  • Familiarity with Python and machine learning. Students should be relatively comfortable with Python coding practices and how ML algorithms work.
  • [Desirable] Familiarity with basic machine learning in Python

Course Set-up

  • Any operating system is fine
  • Python 3.5 or above (Anaconda distribution)
  • Speedy internet connection

Recommended Preparation

Recommended Follow-up

About your instructor

  • Dr. Noureddin Sadawi is a consultant in machine learning and data science. He has several years’ experience in various areas involving data manipulation and analysis. He received his PhD from the University of Birmingham, United Kingdom. During his PhD he developed a technique to extract precise information from bitmap images of chemical structure diagrams. He developed a tool called MolRec and used it to participate in evaluation contests at two international events - TREC2011 and CLEF2012 - and won both of them.

    Noureddin is an avid scientific software researcher and developer who has a passion for learning and teaching new technologies. He has been involved in several projects spanning a variety of fields such as bioinformatics, drug discovery, omics data analysis and much more. He has taught at multiple universities in the UK and has worked as a software engineer in different roles. One of his latest positions was a research associate at the highly respected Imperial College London where he contributed significantly to the PhenoMeNal project (a project that heavily uses docker). Currently, he is a research fellow at the department of computer science, Brunel University – London where he developed deep learning techniques for the analysis of human gesture data.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Part 1: Foundations and Model Evaluation (60 minutes)

  • What is Imbalanced Classification
  • Intuition for Imbalanced Classification
  • Why Imbalanced Classification is a Challenge
  • Model Evaluation Metrics
  • Accuracy and why it is not always suitable
  • An overview of: Precision, Recall, and F-Measure
  • An overview of: ROC Curves and Precision-Recall Curves
  • An overview of: Probability Scoring Methods
  • What is Cross-Validation? And Can it be used for Imbalanced Datasets?
  • Live demonstration of the above techniques

Q&A (10 minutes)

Break (10 minutes)

Part 2: Data Sampling Methods (60 minutes)

  • Data Sampling Methods
  • Random Data Sampling
  • Oversampling Methods
  • Undersampling Methods
  • Combining Oversampling and Undersampling

Q&A (10 minutes)

Break (10 minutes)

Part 3: Cost-Sensitive Classification (60 minutes)

  • Cost-Sensitive Learning
  • Cost-Sensitive Logistic Regression
  • Cost-Sensitive Decision Trees
  • Cost-Sensitive Support Vector Machines
  • Cost-Sensitive Deep Learning in Keras
  • Cost-Sensitive Gradient Boosting with XGBoost

Q&A (10 minutes)