O'Reilly logo
live online training icon Live Online training

Data Processing Essentials for Building Predictive Models with Python: Performing feature selection and dimensionality reduction

enter image description here

Topic: Data
Janani Ravi

Two realities of ML and AI use in enterprises today are that virtually everyone is using the same data, and that everyone is trying to run the same models on that data. The nature of data collection and automation across enterprises has been standardized, so there’s little differentiation in the quantity of data available to modelers. Meanwhile, the democratization of model building means that cloud platforms, ML frameworks, and Python libraries are all available to everyone—and, as a result, are being used by everyone in similar ways. If the data you use to build your model isn’t really a differentiator, and the kind of model you build isn’t either, then the key differentiator is the manner in which you prepare your data.

In this course—the third in a three-part series on data handling and feature engineering—expert Janani Ravi takes you through solutions to address the curse of dimensionality. As data has grown more complex, larger numbers of dimensions are needed to represent it. (For instance, a corpus of images is usually represented as a hard-to-visualize 4D tensor.) In addition to being hard to interpret, models built on such high-dimensional data also suffer from issues both in training and in performance. You’ll learn how to use techniques such principal components analysis (PCA), manifold learning, and autoencoders to mitigate this challenge—and discover when you should use them.

The Data Quality Series is a set of three live online training courses, meant to be followed in this order (although each is a standalone course):

  1. Data Cleaning Essentials for Building Predictive Models with Python (Data Quality Series)
  2. Data Prep Essentials for Building Predictive Models with Python (Data Quality Series)
  3. Data Processing Essentials for Building Predictive Models with Python (Data Quality Series)

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The problems encountered when you work with high-dimensional data
  • The differences between feature selection and dimensionality reduction techniques
  • Techniques to perform feature selection and dimensionality reduction on your data

And you’ll be able to:

  • Perform feature selection for ML models using statistical techniques
  • Perform dimensionality reduction on linear data using techniques such as principal components analysis
  • Perform dimensionality reduction on complex nonlinear data using manifold learning techniques

This training course is for you because...

  • You’re a business analyst who needs to make sense of large quantities of data of uncertain provenance and quality.
  • You’re a data scientist who wants to understand how to use the right data.
  • You’re a data engineer who’s noticed that a model that worked fine in testing isn’t working quite as well in practice.

Prerequisites

  • A working knowledge of Python and the Jupyter Notebook
  • A basic understanding of building and training ML models
  • Familiarity with regression and classification techniques in ML

Recommended preparation:

Recommended follow-up:

About your instructor

  • Janani Ravi is a cofounder of Loonycorn, a team dedicated to upskilling IT professionals. She’s been involved in more than 75 online courses in Azure and GCP. Previously, Janani worked at Google, Flipkart, and Microsoft. She completed her studies at Stanford.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Feature selection techniques (55 minutes)

  • Presentation: The curse of dimensionality; problems with overfitted models; feature selection and dimensionality reduction for relevant data; filter, wrapper, and embedded methods of feature selection
  • Jupyter Notebook exercise: Visualize feature correlations and apply statistical techniques to choose relevant features for model building

Break (5 minutes)

Dimensionality reduction techniques (55 minutes)

  • Presentation: Dimensionality reduction for linear and nonlinear data; problems with reducing complexity
  • Jupyter Notebook exercises: Perform principal components analysis, linear discriminant analysis, and quadratic discriminant analysis for dimensionality reduction; use manifold learning techniques to unroll data; work with real-world data to perform PCA; apply manifold learning techniques to unroll complex nonlinear data

Wrap-up and Q&A (5 minutes)