O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Machine Learning Pocket Reference

Book Description

With detailed notes, tables, and examples, this handy reference will help you navigate the basics of structured machine learning. Author Matt Harrison delivers a valuable guide that you can use for additional support during training and as a convenient resource when you dive into your next machine learning project.

Ideal for programmers, data scientists, and AI engineers, this book includes an overview of the machine learning process and walks you through classification with structured data. You’ll also learn methods for clustering, predicting a continuous value (regression), and reducing dimensionality, among other topics.

This pocket reference includes sections that cover:

  • Classification, using the Titanic dataset
  • Cleaning data and dealing with missing data
  • Exploratory data analysis
  • Common preprocessing steps using sample data
  • Selecting features useful to the model
  • Model selection
  • Metrics and classification evaluation
  • Regression examples using k-nearest neighbor, decision trees, boosting, and more
  • Metrics for regression evaluation
  • Clustering
  • Dimensionality reduction
  • Scikit-learn pipelines

Table of Contents

  1. 1. Introduction
    1. Libraries Used
    2. Installation with Pip
    3. Installation with Conda
  2. 2. Overview of Machine Learning Process
  3. 3. Classification Walkthrough: Titanic Dataset
    1. Project Layout Suggestion
    2. Imports
    3. Ask a Question
    4. Terms for Data
    5. Gather Data
    6. Clean Data
    7. Create Features
    8. Sample Data
    9. Impute Data
    10. Normalize Data
    11. Refactor
    12. Baseline Model
    13. Various Families
    14. Stacking
    15. Create Model
    16. Evaluate Model
    17. Optimize Model
    18. Confusion Matrix
    19. ROC Curve
    20. Learning Curve
    21. Deploy Model
  4. 4. Missing Data
    1. Examining Missing Data
    2. Dropping Missing Data
    3. Imputing Data
    4. Adding Indicator Columns
  5. 5. Cleaning Data
    1. Column Names
    2. Replacing Missing Values
  6. 6. Exploring
    1. Data Size
    2. Summary Stats
    3. Histogram
    4. Scatter Plot
    5. Joint Plot
    6. Pair Grid
    7. Box & Violin Plots
    8. Correlation
    9. RadViz
    10. Parallel Coordinates
  7. 7. Preprocess Data
    1. Standardize
    2. Scale to range
    3. Dummy Variables
    4. Label Encoder
    5. Frequency Encoding
    6. Pulling Categories from Strings
    7. Other Categorical Encoding
    8. Date Feature Engineering
    9. Add col_na Feature
    10. Manual Feature Engineering
  8. 8. Feature Selection
    1. Collinear Columns
    2. Lasso Regression
    3. Recursive Feature Elimination
    4. Mutual Information
    5. Principal Component Analysis
    6. Feature Importance
  9. 9. Imbalanced Classes
    1. Use a Different Metric
    2. Tree Based Algorithms and Ensembles
    3. Penalize Models
    4. Upsampling Minority
    5. Generate Minority Data
    6. Downsampling Majority
    7. Upsampling then Downsampling
  10. 10. Classification
    1. Logistic Regression
    2. Naive Bayes
    3. Support Vector Machine
    4. K Nearest Neighbor
    5. Decision Tree
    6. Random Forest
    7. XGBoost
    8. Gradient Boosted with LightGBM
    9. TPOT
  11. 11. Model Selection
    1. Validation Curve
    2. Learning Curve
  12. 12. Metrics & Classification Evaluation
    1. Confusion Matrix
    2. Metrics
    3. Accuracy
    4. Recall
    5. Precision
    6. F1
    7. Classification Report
    8. ROC
    9. Precision-Recall Curve
    10. Cumulative Gains Plot
    11. Lift Curve
    12. Class Balance
    13. Class Prediction Error
    14. Discrimination Threshold
  13. 13. Explaining Models
    1. Regression Coefficients
    2. Feature Importance
    3. LIME
    4. Tree Interpretation
    5. Partial Dependence Plots
    6. Surrogate Models
    7. Shapley
  14. 14. Regression
    1. Baseline model
    2. Linear Regression
    3. SVM
    4. K Nearest Neighbor
    5. Decision Tree
    6. Random Forest
    7. XGBoost Regression
    8. LightGBM Regression
  15. 15. Metrics & Regression Evaluation
    1. Metrics
    2. Residuals Plot
    3. Heteroscedasticity
    4. Normal Residuals
    5. Prediction Error Plot
  16. 16. Explaining Regression Models
    1. Shapley
  17. 17. Dimensionality Reduction
    1. PCA
    2. UMAP
    3. t-SNE
    4. PHATE
  18. 18. Clustering
    1. K-Means
    2. Agglomerative (Hierarchical) Clustering
  19. 19. Pipelines
    1. Classification Pipeline
    2. Regression Pipeline
    3. PCA Pipeline