book

Machine Learning with Python Cookbook, 2nd Edition

by Kyle Gallatin, Chris Albon

August 2023

Intermediate to advanced

413 pages

8h 21m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Working with Vectors, Matrices, and Arrays in NumPy
1.0. Introduction1.1. Creating a Vector1.2. Creating a Matrix1.3. Creating a Sparse Matrix1.4. Preallocating NumPy Arrays1.5. Selecting Elements1.6. Describing a Matrix1.7. Applying Functions over Each Element1.8. Finding the Maximum and Minimum Values1.9. Calculating the Average, Variance, and Standard Deviation1.10. Reshaping Arrays1.11. Transposing a Vector or Matrix1.12. Flattening a Matrix1.13. Finding the Rank of a Matrix1.14. Getting the Diagonal of a Matrix1.15. Calculating the Trace of a Matrix1.16. Calculating Dot Products1.17. Adding and Subtracting Matrices1.18. Multiplying Matrices1.19. Inverting a Matrix1.20. Generating Random Values
2. Loading Data
2.0. Introduction2.1. Loading a Sample Dataset2.2. Creating a Simulated Dataset2.3. Loading a CSV File2.4. Loading an Excel File2.5. Loading a JSON File2.6. Loading a Parquet File2.7. Loading an Avro File2.8. Querying a SQLite Database2.9. Querying a Remote SQL Database2.10. Loading Data from a Google Sheet2.11. Loading Data from an S3 Bucket2.12. Loading Unstructured Data
3. Data Wrangling
3.0. Introduction3.1. Creating a Dataframe3.2. Getting Information about the Data3.3. Slicing DataFrames3.4. Selecting Rows Based on Conditionals3.5. Sorting Values3.6. Replacing Values3.7. Renaming Columns3.8. Finding the Minimum, Maximum, Sum, Average, and Count3.9. Finding Unique Values3.10. Handling Missing Values3.11. Deleting a Column3.12. Deleting a Row3.13. Dropping Duplicate Rows3.14. Grouping Rows by Values3.15. Grouping Rows by Time3.16. Aggregating Operations and Statistics3.17. Looping over a Column3.18. Applying a Function over All Elements in a Column3.19. Applying a Function to Groups3.20. Concatenating DataFrames3.21. Merging DataFrames
4. Handling Numerical Data
4.0. Introduction4.1. Rescaling a Feature4.2. Standardizing a Feature4.3. Normalizing Observations4.4. Generating Polynomial and Interaction Features4.5. Transforming Features4.6. Detecting Outliers4.7. Handling Outliers4.8. Discretizating Features4.9. Grouping Observations Using Clustering4.10. Deleting Observations with Missing Values4.11. Imputing Missing Values
5. Handling Categorical Data
5.0. Introduction5.1. Encoding Nominal Categorical Features5.2. Encoding Ordinal Categorical Features5.3. Encoding Dictionaries of Features5.4. Imputing Missing Class Values5.5. Handling Imbalanced Classes
6. Handling Text
6.0. Introduction6.1. Cleaning Text6.2. Parsing and Cleaning HTML6.3. Removing Punctuation6.4. Tokenizing Text6.5. Removing Stop Words6.6. Stemming Words6.7. Tagging Parts of Speech6.8. Performing Named-Entity Recognition6.9. Encoding Text as a Bag of Words6.10. Weighting Word Importance6.11. Using Text Vectors to Calculate Text Similarity in a Search Query6.12. Using a Sentiment Analysis Classifier
7. Handling Dates and Times
7.0. Introduction7.1. Converting Strings to Dates7.2. Handling Time Zones7.3. Selecting Dates and Times7.4. Breaking Up Date Data into Multiple Features7.5. Calculating the Difference Between Dates7.6. Encoding Days of the Week7.7. Creating a Lagged Feature7.8. Using Rolling Time Windows7.9. Handling Missing Data in Time Series
8. Handling Images
8.0. Introduction8.1. Loading Images8.2. Saving Images8.3. Resizing Images8.4. Cropping Images8.5. Blurring Images8.6. Sharpening Images8.7. Enhancing Contrast8.8. Isolating Colors8.9. Binarizing Images8.10. Removing Backgrounds8.11. Detecting Edges8.12. Detecting Corners8.13. Creating Features for Machine Learning8.14. Encoding Color Histograms as Features8.15. Using Pretrained Embeddings as Features8.16. Detecting Objects with OpenCV8.17. Classifying Images with Pytorch
9. Dimensionality Reduction Using Feature Extraction
9.0. Introduction9.1. Reducing Features Using Principal Components9.2. Reducing Features When Data Is Linearly Inseparable9.3. Reducing Features by Maximizing Class Separability9.4. Reducing Features Using Matrix Factorization9.5. Reducing Features on Sparse Data

10. Dimensionality Reduction Using Feature Selection
10.0. Introduction10.1. Thresholding Numerical Feature Variance10.2. Thresholding Binary Feature Variance10.3. Handling Highly Correlated Features10.4. Removing Irrelevant Features for Classification10.5. Recursively Eliminating Features
11. Model Evaluation
11.0. Introduction11.1. Cross-Validating Models11.2. Creating a Baseline Regression Model11.3. Creating a Baseline Classification Model11.4. Evaluating Binary Classifier Predictions11.5. Evaluating Binary Classifier Thresholds11.6. Evaluating Multiclass Classifier Predictions11.7. Visualizing a Classifier’s Performance11.8. Evaluating Regression Models11.9. Evaluating Clustering Models11.10. Creating a Custom Evaluation Metric11.11. Visualizing the Effect of Training Set Size11.12. Creating a Text Report of Evaluation Metrics11.13. Visualizing the Effect of Hyperparameter Values
12. Model Selection
12.0. Introduction12.1. Selecting the Best Models Using Exhaustive Search12.2. Selecting the Best Models Using Randomized Search12.3. Selecting the Best Models from Multiple Learning Algorithms12.4. Selecting the Best Models When Preprocessing12.5. Speeding Up Model Selection with Parallelization12.6. Speeding Up Model Selection Using Algorithm-Specific Methods12.7. Evaluating Performance After Model Selection
13. Linear Regression
13.0. Introduction13.1. Fitting a Line13.2. Handling Interactive Effects13.3. Fitting a Nonlinear Relationship13.4. Reducing Variance with Regularization13.5. Reducing Features with Lasso Regression
14. Trees and Forests
14.0. Introduction14.1. Training a Decision Tree Classifier14.2. Training a Decision Tree Regressor14.3. Visualizing a Decision Tree Model14.4. Training a Random Forest Classifier14.5. Training a Random Forest Regressor14.6. Evaluating Random Forests with Out-of-Bag Errors14.7. Identifying Important Features in Random Forests14.8. Selecting Important Features in Random Forests14.9. Handling Imbalanced Classes14.10. Controlling Tree Size14.11. Improving Performance Through Boosting14.12. Training an XGBoost Model14.13. Improving Real-Time Performance with LightGBM
15. K-Nearest Neighbors
15.0. Introduction15.1. Finding an Observation’s Nearest Neighbors15.2. Creating a K-Nearest Neighbors Classifier15.3. Identifying the Best Neighborhood Size15.4. Creating a Radius-Based Nearest Neighbors Classifier15.5. Finding Approximate Nearest Neighbors15.6. Evaluating Approximate Nearest Neighbors
16. Logistic Regression
16.0. Introduction16.1. Training a Binary Classifier16.2. Training a Multiclass Classifier16.3. Reducing Variance Through Regularization16.4. Training a Classifier on Very Large Data16.5. Handling Imbalanced Classes
17. Support Vector Machines
17.0. Introduction17.1. Training a Linear Classifier17.2. Handling Linearly Inseparable Classes Using Kernels17.3. Creating Predicted Probabilities17.4. Identifying Support Vectors17.5. Handling Imbalanced Classes
18. Naive Bayes
18.0. Introduction18.1. Training a Classifier for Continuous Features18.2. Training a Classifier for Discrete and Count Features18.3. Training a Naive Bayes Classifier for Binary Features18.4. Calibrating Predicted Probabilities
19. Clustering
19.0. Introduction19.1. Clustering Using K-Means19.2. Speeding Up K-Means Clustering19.3. Clustering Using Mean Shift19.4. Clustering Using DBSCAN19.5. Clustering Using Hierarchical Merging
20. Tensors with PyTorch
20.0. Introduction20.1. Creating a Tensor20.2. Creating a Tensor from NumPy20.3. Creating a Sparse Tensor20.4. Selecting Elements in a Tensor20.5. Describing a Tensor20.6. Applying Operations to Elements20.7. Finding the Maximum and Minimum Values20.8. Reshaping Tensors20.9. Transposing a Tensor20.10. Flattening a Tensor20.11. Calculating Dot Products20.12. Multiplying Tensors
21. Neural Networks
21.0. Introduction21.1. Using Autograd with PyTorch21.2. Preprocessing Data for Neural Networks21.3. Designing a Neural Network21.4. Training a Binary Classifier21.5. Training a Multiclass Classifier21.6. Training a Regressor21.7. Making Predictions21.8. Visualize Training History21.9. Reducing Overfitting with Weight Regularization21.10. Reducing Overfitting with Early Stopping21.11. Reducing Overfitting with Dropout21.12. Saving Model Training Progress21.13. Tuning Neural Networks21.14. Visualizing Neural Networks
22. Neural Networks for Unstructured Data
22.0. Introduction22.1. Training a Neural Network for Image Classification22.2. Training a Neural Network for Text Classification22.3. Fine-Tuning a Pretrained Model for Image Classification22.4. Fine-Tuning a Pretrained Model for Text Classification
23. Saving, Loading, and Serving Trained Models
23.0. Introduction23.1. Saving and Loading a scikit-learn Model23.2. Saving and Loading a TensorFlow Model23.3. Saving and Loading a PyTorch Model23.4. Serving scikit-learn Models23.5. Serving TensorFlow Models23.6. Serving PyTorch Models in Seldon
Index
About the Authors

Content preview from Machine Learning with Python Cookbook, 2nd Edition

Chapter 5. Handling Categorical Data

5.0 Introduction

It is often useful to measure objects not in terms of their quantity but in terms of some quality. We frequently represent qualitative information in categories such as gender, colors, or brand of car. However, not all categorical data is the same. Sets of categories with no intrinsic ordering are called nominal. Examples of nominal categories include:

Blue, Red, Green
Man, Woman
Banana, Strawberry, Apple

In contrast, when a set of categories has some natural ordering we refer to it as ordinal. For example:

Low, Medium, High
Young, Old
Agree, Neutral, Disagree

Furthermore, categorical information is often represented in data as a vector or column of strings (e.g., "Maine", "Texas", "Delaware"). The problem is that most machine learning algorithms require inputs to be numerical values.

The k-nearest neighbors algorithm is an example of an algorithm that requires numerical data. One step in the algorithm is calculating the distances between observations—often using Euclidean distance:

\sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

where $x$ and $y$ are two observations and subscript $i$ denotes the value for the observations’ $i$ th feature. However, the distance calculation obviously is impossible if the value of $x_{i}$ is a string (e.g., "Texas"). Instead, we need to convert the string into some numerical format so that it can be input into the Euclidean distance equation. Our goal is to transform the data in a way that properly captures ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning Engineering with Python - Second Edition

Publisher Resources

ISBN: 9781098135713Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Machine Learning with Python Cookbook, 2nd Edition

by Kyle Gallatin, Chris Albon

Chapter 5. Handling Categorical Data

5.0 Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.