O'Reilly logo
live online training icon Live Online training

Visualizing High-Dimensional Data with Python

Learn how to use dimensionality reduction to better understand your data

enter image description here

Topic: Data
Jeroen Janssens

Understanding your data is key in any data science project. Visualization is useful but can be challenging when the data has a high dimensionality. This includes complex data types such as text, images, and sensor measurements from fields such as industry, healthcare, and transportation. You could create a scatter plot matrix, but this can only show how any two features interact and fails to capture structure across many dimensions. But not to worry—there’s an entire subfield within machine learning concerned with exactly this challenge: dimensionality reduction. Dimensionality reduction algorithms can help you gain insight into your high-dimensional data and reveal whether there’s any structure.

Expert Jeroen Janssens walks you through three well-known dimensionality reduction algorithms: principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Join in to learn when and why to use dimensionality reduction, the benefits and limitations of the various algorithms, how they work under the hood, and how to apply them using Python and the Jupyter Notebook.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The importance of visualizing high-dimensional data
  • The benefits and limitations of various dimensionality reduction algorithms, including principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP)
  • The inner workings of the algorithms on a more detailed level

And you’ll be able to:

  • Apply dimensionality reduction algorithms in Python and the Jupyter Notebook
  • Choose the right parameter settings given an algorithm and dataset
  • Visualize the resulting mapping using your favourite plotting package, whether that's Matplotlib, Altair, seaborn, or plotnine

This training course is for you because...

  • You’re a data scientist, BI specialist, statistician, or machine learning engineer who works with complex data.
  • You want to understand how dimensionality reduction works and how it can help you.
  • You aim to reveal any structure in your data through visualization.

Prerequisites

  • A working knowledge of Python
  • Familiarity with the scikit-learn API (useful but not required)

Recommended follow-up:

About your instructor

  • Jeroen Janssens is an instructor at Data Science Workshops, which organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. All related to data science. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

The importance of dimensionality reduction (20 minutes)

  • Group discussion: What kind of data do you work with? How do you currently visualize it?
  • Presentation: What is dimensionality reduction?; Why do it?; overview of dimensionality reduction algorithms
  • Demo: The disadvantage of a scatter plot matrix
  • Q&A

Algorithm: PCA (35 minutes)

  • Presentation: An intuitive understanding of PCA (principal component analysis)
  • Demo: Visualizing results using Matplotlib, seaborn, Altair, or plotnine
  • Jupyter Notebook exercise: Apply PCA and visualize results
  • Q&A

Break (5 minutes)

Algorithm: t-SNE (55 minutes)

  • Presentation: A deep dive into t-SNE (t-Distributed Stochastic Neighbor Embedding)
  • Jupyter Notebook exercise: Explore the influence of the parameter perplexity
  • Q&A

Break (5 minutes)

Algorithm: UMAP (40 minutes)

  • Presentation: The difference between t-SNE and UMAP (Uniform Manifold Approximation and Projection)
  • Jupyter Notebook exercise: Apply UMAP and compare results with t-SNE

Wrap-up and Q&A (20 minutes)