book

Machine Learning with Python Cookbook

Name: Machine Learning with Python Cookbook
Author: Chris Albon
ISBN: 9781491989388

by Chris Albon

March 2018

Intermediate to advanced

364 pages

7h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForWho This Book Is Not ForTerminology Used in This BookAcknowledgments
1. Vectors, Matrices, and Arrays
1.0. Introduction1.1. Creating a Vector1.2. Creating a Matrix1.3. Creating a Sparse Matrix1.4. Selecting Elements1.5. Describing a Matrix1.6. Applying Operations to Elements1.7. Finding the Maximum and Minimum Values1.8. Calculating the Average, Variance, and Standard Deviation1.9. Reshaping Arrays1.10. Transposing a Vector or Matrix1.11. Flattening a Matrix1.12. Finding the Rank of a Matrix1.13. Calculating the Determinant1.14. Getting the Diagonal of a Matrix1.15. Calculating the Trace of a Matrix1.16. Finding Eigenvalues and Eigenvectors1.17. Calculating Dot Products1.18. Adding and Subtracting Matrices1.19. Multiplying Matrices1.20. Inverting a Matrix1.21. Generating Random Values
2. Loading Data
2.0. Introduction2.1. Loading a Sample Dataset2.2. Creating a Simulated Dataset2.3. Loading a CSV File2.4. Loading an Excel File2.5. Loading a JSON File2.6. Querying a SQL Database
3. Data Wrangling
3.0. Introduction3.1. Creating a Data Frame3.2. Describing the Data3.3. Navigating DataFrames3.4. Selecting Rows Based on Conditionals3.5. Replacing Values3.6. Renaming Columns3.7. Finding the Minimum, Maximum, Sum, Average, and Count3.8. Finding Unique Values3.9. Handling Missing Values3.10. Deleting a Column3.11. Deleting a Row3.12. Dropping Duplicate Rows3.13. Grouping Rows by Values3.14. Grouping Rows by Time3.15. Looping Over a Column3.16. Applying a Function Over All Elements in a Column3.17. Applying a Function to Groups3.18. Concatenating DataFrames3.19. Merging DataFrames
4. Handling Numerical Data
4.0. Introduction4.1. Rescaling a Feature4.2. Standardizing a Feature4.3. Normalizing Observations4.4. Generating Polynomial and Interaction Features4.5. Transforming Features4.6. Detecting Outliers4.7. Handling Outliers4.8. Discretizating Features4.9. Grouping Observations Using Clustering4.10. Deleting Observations with Missing Values4.11. Imputing Missing Values
5. Handling Categorical Data
5.0. Introduction5.1. Encoding Nominal Categorical Features5.2. Encoding Ordinal Categorical Features5.3. Encoding Dictionaries of Features5.4. Imputing Missing Class Values5.5. Handling Imbalanced Classes
6. Handling Text
6.0. Introduction6.1. Cleaning Text6.2. Parsing and Cleaning HTML6.3. Removing Punctuation6.4. Tokenizing Text6.5. Removing Stop Words6.6. Stemming Words6.7. Tagging Parts of Speech6.8. Encoding Text as a Bag of Words6.9. Weighting Word Importance
7. Handling Dates and Times
7.0. Introduction7.1. Converting Strings to Dates7.2. Handling Time Zones7.3. Selecting Dates and Times7.4. Breaking Up Date Data into Multiple Features7.5. Calculating the Difference Between Dates7.6. Encoding Days of the Week7.7. Creating a Lagged Feature7.8. Using Rolling Time Windows7.9. Handling Missing Data in Time Series
8. Handling Images
8.0. Introduction8.1. Loading Images8.2. Saving Images8.3. Resizing Images8.4. Cropping Images8.5. Blurring Images8.6. Sharpening Images8.7. Enhancing Contrast8.8. Isolating Colors8.9. Binarizing Images8.10. Removing Backgrounds8.11. Detecting Edges8.12. Detecting Corners8.13. Creating Features for Machine Learning8.14. Encoding Mean Color as a Feature8.15. Encoding Color Histograms as Features
9. Dimensionality Reduction Using Feature Extraction
9.0. Introduction9.1. Reducing Features Using Principal Components9.2. Reducing Features When Data Is Linearly Inseparable9.3. Reducing Features by Maximizing Class Separability9.4. Reducing Features Using Matrix Factorization9.5. Reducing Features on Sparse Data

10. Dimensionality Reduction Using Feature Selection
10.0. Introduction10.1. Thresholding Numerical Feature Variance10.2. Thresholding Binary Feature Variance10.3. Handling Highly Correlated Features10.4. Removing Irrelevant Features for Classification10.5. Recursively Eliminating Features
11. Model Evaluation
11.0. Introduction11.1. Cross-Validating Models11.2. Creating a Baseline Regression Model11.3. Creating a Baseline Classification Model11.4. Evaluating Binary Classifier Predictions11.5. Evaluating Binary Classifier Thresholds11.6. Evaluating Multiclass Classifier Predictions11.7. Visualizing a Classifier’s Performance11.8. Evaluating Regression Models11.9. Evaluating Clustering Models11.10. Creating a Custom Evaluation Metric11.11. Visualizing the Effect of Training Set Size11.12. Creating a Text Report of Evaluation Metrics11.13. Visualizing the Effect of Hyperparameter Values
12. Model Selection
12.0. Introduction12.1. Selecting Best Models Using Exhaustive Search12.2. Selecting Best Models Using Randomized Search12.3. Selecting Best Models from Multiple Learning Algorithms12.4. Selecting Best Models When Preprocessing12.5. Speeding Up Model Selection with Parallelization12.6. Speeding Up Model Selection Using Algorithm-Specific Methods12.7. Evaluating Performance After Model Selection
13. Linear Regression
13.0. Introduction13.1. Fitting a Line13.2. Handling Interactive Effects13.3. Fitting a Nonlinear Relationship13.4. Reducing Variance with Regularization13.5. Reducing Features with Lasso Regression
14. Trees and Forests
14.0. Introduction14.1. Training a Decision Tree Classifier14.2. Training a Decision Tree Regressor14.3. Visualizing a Decision Tree Model14.4. Training a Random Forest Classifier14.5. Training a Random Forest Regressor14.6. Identifying Important Features in Random Forests14.7. Selecting Important Features in Random Forests14.8. Handling Imbalanced Classes14.9. Controlling Tree Size14.10. Improving Performance Through Boosting14.11. Evaluating Random Forests with Out-of-Bag Errors
15. K-Nearest Neighbors
15.0. Introduction15.1. Finding an Observation’s Nearest Neighbors15.2. Creating a K-Nearest Neighbor Classifier15.3. Identifying the Best Neighborhood Size15.4. Creating a Radius-Based Nearest Neighbor Classifier
16. Logistic Regression
16.0. Introduction16.1. Training a Binary Classifier16.2. Training a Multiclass Classifier16.3. Reducing Variance Through Regularization16.4. Training a Classifier on Very Large Data16.5. Handling Imbalanced Classes
17. Support Vector Machines
17.0. Introduction17.1. Training a Linear Classifier17.2. Handling Linearly Inseparable Classes Using Kernels17.3. Creating Predicted Probabilities17.4. Identifying Support Vectors17.5. Handling Imbalanced Classes
18. Naive Bayes
18.0. Introduction18.1. Training a Classifier for Continuous Features18.2. Training a Classifier for Discrete and Count Features18.3. Training a Naive Bayes Classifier for Binary Features18.4. Calibrating Predicted Probabilities
19. Clustering
19.0. Introduction19.1. Clustering Using K-Means19.2. Speeding Up K-Means Clustering19.3. Clustering Using Meanshift19.4. Clustering Using DBSCAN19.5. Clustering Using Hierarchical Merging
20. Neural Networks
20.0. Introduction20.1. Preprocessing Data for Neural Networks20.2. Designing a Neural Network20.3. Training a Binary Classifier20.4. Training a Multiclass Classifier20.5. Training a Regressor20.6. Making Predictions20.7. Visualize Training History20.8. Reducing Overfitting with Weight Regularization20.9. Reducing Overfitting with Early Stopping20.10. Reducing Overfitting with Dropout20.11. Saving Model Training Progress20.12. k-Fold Cross-Validating Neural Networks20.13. Tuning Neural Networks20.14. Visualizing Neural Networks20.15. Classifying Images20.16. Improving Performance with Image Augmentation20.17. Classifying Text
21. Saving and Loading Trained Models
21.0. Introduction21.1. Saving and Loading a scikit-learn Model21.2. Saving and Loading a Keras Model
Index

Content preview from Machine Learning with Python Cookbook

Chapter 4. Handling Numerical Data

4.0 Introduction

Quantitative data is the measurement of something—whether class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 29 students, $529,392 in sales). In this chapter, we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms.

4.1 Rescaling a Feature

Problem

You need to rescale the values of a numerical feature to be between two values.

Solution

Use scikit-learn’s MinMaxScaler to rescale a feature array:

# Load libraries
import numpy as np
from sklearn import preprocessing

# Create feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])

# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)

# Show feature
scaled_feature

array([[ 0.        ],
       [ 0.28571429],
       [ 0.35714286],
       [ 0.42857143],
       [ 1.        ]])

Discussion

Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or –1 to 1. There are a number of rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specifically, min-max calculates:

x_{i}^{'} = \frac{x_{i} - min (x)}{max (x) - min (x)}

where x is the feature vector, x

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Python Machine Learning Cookbook - Second Edition

Publisher Resources

ISBN: 9781491989371Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills