O'Reilly logo
live online training icon Live Online training

Data Prep Essentials for Building Predictive Models with Python: Processing numeric, categorical, and text data

enter image description here

Topic: Data
Janani Ravi

It’s probably no exaggeration to say that the explosion of AI and ML is directly linked to the ubiquity of the cameras and free-form text that dominate social media and modern-day communication. For decades, ML/AI models have been able to work with rigidly structured, tightly defined input features, but building highly structured and scrubbed datasets is hard, requiring significant expertise and laborious effort. As a result ML/AI models only became hugely popular when researchers found ways for them to work seamlessly with images, text, and video.

In this course—the second in a three-part series on data handling and feature engineering—expert Janani Ravi shows you how to work with features that are in the most inconvenient form of all: one-dimensional arrays of continuous numbers. Put differently, this course focuses on categorical data, textual data, and the numeric representation of images. (For instance, a grayscale image is represented as a two-dimensional matrix or tensor; an RGB image needs a three-dimensional tensor, and a corpus of images will be represented as a hard-to-visualize four-dimensional tensor.) In just two hours, you’ll learn how to use various specialized techniques for representing text data—and compare the strengths and the shortcomings of each.

The Data Quality Series is a set of three live online training courses, meant to be followed in this order (although each is a standalone course): 1. Data Cleaning Essentials for Building Predictive Models with Python (Data Quality Series) 2. Data Prep Essentials for Building Predictive Models with Python (Data Quality Series) 3. Data Processing Essentials for Building Predictive Models with Python (Data Quality Series)

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The importance of preprocessing data to build predictive models
  • How to preprocess numeric data to feed into ML models
  • Techniques for preprocessing text data to feed into ML models

And you’ll be able to:

  • Use Python to perform standardization and scaling operations to process numeric data
  • Perform L1, L2, and max normalization on your data
  • Create numeric representations of text data to build and train ML models

This training course is for you because...

  • You’re a business analyst who needs to make sense of large quantities of data of uncertain provenance and quality.
  • You’re a data scientist who wants to understand how to use the right data.
  • You’re a data engineer who’s noticed that a model that worked fine in testing isn’t working quite as well in practice.


  • A working knowledge of Python and the Jupyter Notebook
  • A basic understanding of building and training ML models
  • Familiarity with regression and classification techniques in ML

Recommended preparation:

Recommended follow-up:

About your instructor

  • Janani Ravi is a cofounder of Loonycorn, a team dedicated to upskilling IT professionals. She’s been involved in more than 75 online courses in Azure and GCP. Previously, Janani worked at Google, Flipkart, and Microsoft. She completed her studies at Stanford.


The timeframes are only estimates and may vary according to how the class is progressing

Preprocessing numeric data (55 minutes)

  • Presentation: Continuous and categorical data for ML; the differences between data standardization and normalization; normalizing data using L1, L2, and max norms
  • Jupyter Notebook exercises: Standardize numeric values; perform robust scaling for data with outliers; preprocess numeric data using techniques learned in this session as a precursor to fitting an ML model

Break (5 minutes)

Preprocessing text data (55 minutes)

  • Presentation: Generating numeric representations of text data; one-hot encoding, count vector encoding, and tf-idf; bag-of-words and bag-of-n-grams models; word embeddings to capture meaning and semantic relationships in text data
  • Jupyter Notebook exercise: Work with real-world text data, convert to a numeric representation, and fit an ML model
  • Group discussion: The pros and cons of different numeric representations of text data

Wrap-up and Q&A (5 minutes)