O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Machine Learning Pocket Reference

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles.

With detailed notes, tables, and examples, this handy reference will help you navigate the basics of structured machine learning. Author Matt Harrison delivers a valuable guide that you can use for additional support during training and as a convenient resource when you dive into your next machine learning project.

Ideal for programmers, data scientists, and AI engineers, this book includes an overview of the machine learning process and walks you through classification with structured data. You’ll also learn methods for clustering, predicting a continuous value (regression), and reducing dimensionality, among other topics.

This pocket reference includes sections that cover:

  • Classification, using the Titanic dataset
  • Cleaning data and dealing with missing data
  • Exploratory data analysis
  • Common preprocessing steps using sample data
  • Selecting features useful to the model
  • Model selection
  • Metrics and classification evaluation
  • Regression examples using k-nearest neighbor, decision trees, boosting, and more
  • Metrics for regression evaluation
  • Clustering
  • Dimensionality reduction
  • Scikit-learn pipelines

Table of Contents

  1. 1. Introduction
    1. Libraries Used
    2. Installation with Pip
    3. Installation with Conda
  2. 2. Overview of Machine Learning Process
  3. 3. Classification Walkthrough: Titanic Dataset
    1. Project Layout Suggestion
    2. Imports
    3. Ask a Question
    4. Terms for Data
    5. Gather Data
    6. Clean Data
    7. Create Features
    8. Sample Data
    9. Impute Data
    10. Normalize Data
    11. Refactor
    12. Baseline Model
    13. Various Families
    14. Stacking
    15. Create Model
    16. Evaluate Model
    17. Optimize Model
    18. Confusion Matrix
    19. ROC Curve
    20. Learning Curve
    21. Deploy Model
  4. 4. Missing Data
    1. Examining Missing Data
    2. Dropping Missing Data
    3. Imputing Data
    4. Adding Indicator Columns
  5. 5. Cleaning Data
    1. Column Names
    2. Replacing Missing Values
  6. 6. Exploring
    1. Data Size
    2. Summary Stats
    3. Histogram
    4. Scatter Plot
    5. Joint Plot
    6. Pair Grid
    7. Box & Violin Plots
    8. Correlation
    9. RadViz
    10. Parallel Coordinates
  7. 7. Preprocess Data
    1. Standardize
    2. Scale to range
    3. Dummy Variables
    4. Label Encoder
    5. Date Feature Engineering
    6. Add col_na Feature
    7. Manual Feature Engineering
    8. Automated Feature Engineering
  8. 8. Feature Selection
    1. Collinear Columns
    2. Lasso Regression
    3. Recursive Feature Elimination
    4. Mutual Information
    5. Principal Component Analysis
    6. Feature Importance
  9. 9. Imbalanced Classes
    1. Use a Different Metric
    2. Tree Based Algorithms and Ensembles
    3. Penalize Models
    4. Upsampling Minority
    5. Generate Minority Data
    6. Downsampling Majority
    7. Upsampling then Downsampling
  10. 10. Classification
    1. Logistic Regression
    2. Naive Bayes
    3. Support Vector Machine
    4. K Nearest Neighbor
    5. Decision Tree
    6. Random Forest
    7. XGBoost
    8. Gradient Boosted with LightGBM
    9. TPOT
  11. 11. Model Selection
    1. Validation Curve
    2. Learning Curve
  12. 12. Metrics & Classification Evaluation
    1. Classification Report
    2. Confusion Matrix
    3. Accuracy
    4. Recall
    5. Precision
    6. F1
    7. ROC
    8. Precision-Recall Curve
    9. Class Balance
    10. Class Prediction Error
    11. Discrimination Threshold
  13. 13. Explaining Models
    1. Tree Interpretation
    2. Partial Dependence Plots
    3. Surrogate Models
    4. Shapley