Tidy Modeling with R

Book description

Get going with tidymodels, a collection of R packages for modeling and machine learning. Whether you're just starting out or have years of experience with modeling, this practical introduction shows data analysts, business analysts, and data scientists how the tidymodels framework offers a consistent, flexible approach for your work.

RStudio engineers Max Kuhn and Julia Silge demonstrate ways to create models by focusing on an R dialect called the tidyverse. Software that adopts tidyverse principles shares both a high-level design philosophy and low-level grammar and data structures, so learning one piece of the ecosystem makes it easier to learn the next. You'll understand why the tidymodels framework has been built to be used by a broad range of people.

With this book, you will:

  • Learn the steps necessary to build a model from beginning to end
  • Understand how to use different modeling and feature engineering approaches fluently
  • Examine the options for avoiding common pitfalls of modeling, such as overfitting
  • Learn practical methods to prepare your data for modeling
  • Tune models for optimal performance
  • Use good statistical practices to compare, evaluate, and choose among models

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. I. Introduction
  3. 1. Software for Modeling
    1. Fundamentals for Modeling Software
    2. Types of Models
      1. Descriptive Models
      2. Inferential Models
      3. Predictive Models
    3. Connections Between Types of Models
    4. Some Terminology
    5. How Does Modeling Fit into the Data Analysis Process?
    6. Chapter Summary
  4. 2. A Tidyverse Primer
    1. Tidyverse Principles
      1. Design for Humans
      2. Reuse Existing Data Structures
      3. Design for the Pipe and Functional Programming
    2. Examples of Tidyverse Syntax
    3. Chapter Summary
  5. 3. A Review of R Modeling Fundamentals
    1. An Example
    2. What Does the R Formula Do?
    3. Why Tidiness Is Important for Modeling
    4. Combining Base R Models and the Tidyverse
    5. The tidymodels Metapackage
    6. Chapter Summary
  6. II. Modeling Basics
  7. 4. The Ames Housing Data
    1. Exploring Features of Homes in Ames
    2. Chapter Summary
  8. 5. Spending Our Data
    1. Common Methods for Splitting Data
    2. What About a Validation Set?
    3. Multilevel Data
    4. Other Considerations for a Data Budget
    5. Chapter Summary
  9. 6. Fitting Models with parsnip
    1. Create a Model
    2. Use the Model Results
    3. Make Predictions
    4. parsnip-Extension Packages
    5. Creating Model Specifications
    6. Chapter Summary
  10. 7. A Model Workflow
    1. Where Does the Model Begin and End?
    2. Workflow Basics
    3. Adding Raw Variables to the workflow()
    4. How Does a workflow() Use the Formula?
      1. Tree-Based Models
      2. Special Formulas and Inline Functions
    5. Creating Multiple Workflows at Once
    6. Evaluating the Test Set
    7. Chapter Summary
  11. 8. Feature Engineering with Recipes
    1. A Simple recipe() for the Ames Housing Data
    2. Using Recipes
    3. How Data Are Used by the recipe()
    4. Examples of Steps
      1. Encoding Qualitative Data in a Numeric Format
      2. Interaction Terms
      3. Spline Functions
      4. Feature Extraction
      5. Row Sampling Steps
      6. General Transformations
      7. Natural Language Processing
    5. Skipping Steps for New Data
    6. Tidy a recipe()
    7. Column Roles
    8. Chapter Summary
  12. 9. Judging Model Effectiveness
    1. Performance Metrics and Inference
    2. Regression Metrics
    3. Binary Classification Metrics
    4. Multiclass Classification Metrics
    5. Chapter Summary
  13. III. Tools for Creating Effective Models
  14. 10. Resampling for Evaluating Performance
    1. The Resubstitution Approach
    2. Resampling Methods
      1. Cross-Validation
      2. Repeated Cross-Validation
      3. Leave-One-Out Cross-Validation
      4. Monte Carlo Cross-Validation
      5. Validation Sets
      6. Bootstrapping
      7. Rolling Forecasting Origin Resampling
    3. Estimating Performance
    4. Parallel Processing
    5. Saving the Resampled Objects
    6. Chapter Summary
  15. 11. Comparing Models with Resampling
    1. Creating Multiple Models with Workflow Sets
    2. Comparing Resampled Performance Statistics
    3. Simple Hypothesis Testing Methods
    4. Bayesian Methods
      1. A Random Intercept Model
      2. The Effect of the Amount of Resampling
    5. Chapter Summary
  16. 12. Model Tuning and the Dangers of Overfitting
    1. Model Parameters
    2. Tuning Parameters for Different Types of Models
    3. What Do We Optimize?
    4. The Consequences of Poor Parameter Estimates
    5. Two General Strategies for Optimization
    6. Tuning Parameters in tidymodels
    7. Chapter Summary
  17. 13. Grid Search
    1. Regular and Nonregular Grids
      1. Regular Grids
      2. Nonregular Grids
    2. Evaluating the Grid
    3. Finalizing the Model
    4. Tools for Creating Tuning Specifications
    5. Tools for Efficient Grid Search
      1. Submodel Optimization
      2. Parallel Processing
      3. Benchmarking Boosted Trees
      4. Access to Global Variables
      5. Racing Methods
    6. Chapter Summary
  18. 14. Iterative Search
    1. A Support Vector Machine Model
    2. Bayesian Optimization
      1. A Gaussian Process Model
      2. Acquisition Functions
      3. The tune_bayes() Function
    3. Simulated Annealing
      1. Simulated Annealing Search Process
      2. The tune_sim_anneal() Function
    4. Chapter Summary
  19. 15. Screening Many Models
    1. Modeling Concrete Mixture Strength
    2. Creating the Workflow Set
    3. Tuning and Evaluating the Models
    4. Efficiently Screening Models
    5. Finalizing a Model
    6. Chapter Summary
  20. IV. Beyond the Basics
  21. 16. Dimensionality Reduction
    1. What Problems Can Dimensionality Reduction Solve?
    2. A Picture Is Worth a Thousand…Beans
    3. A Starter Recipe
    4. Recipes in the Wild
      1. Preparing a Recipe
      2. Baking the Recipe
    5. Feature Extraction Techniques
      1. Principal Component Analysis
      2. Partial Least Squares
      3. Independent Component Analysis
      4. Uniform Manifold Approximation and Projection
    6. Modeling
    7. Chapter Summary
  22. 17. Encoding Categorical Data
    1. Is an Encoding Necessary?
    2. Encoding Ordinal Predictors
    3. Using the Outcome for Encoding Predictors
      1. Effect Encodings in tidymodels
      2. Effect Encodings with Partial Pooling
    4. Feature Hashing
    5. More Encoding Options
    6. Chapter Summary
  23. 18. Explaining Models and Predictions
    1. Software for Model Explanations
    2. Local Explanations
    3. Global Explanations
    4. Building Global Explanations from Local Explanations
    5. Back to Beans!
    6. Chapter Summary
  24. 19. When Should You Trust Your Predictions?
    1. Equivocal Results
    2. Determining Model Applicability
    3. Chapter Summary
  25. 20. Ensembles of Models
    1. Creating the Training Set for Stacking
    2. Blend the Predictions
    3. Fit the Member Models
    4. Test Set Results
    5. Chapter Summary
  26. 21. Inferential Analysis
    1. Inference for Count Data
    2. Comparisons with Two-Sample Tests
    3. Log-Linear Models
    4. A More Complex Model
    5. More Inferential Analysis
    6. Chapter Summary
  27. A. Recommended Preprocessing
  28. References
  29. Index
  30. About the Authors

Product information

  • Title: Tidy Modeling with R
  • Author(s): Max Kuhn, Julia Silge
  • Release date: July 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492096481