book

Tidy Modeling with R

by Max Kuhn, Julia Silge

July 2022

Beginner to intermediate

381 pages

9h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Introduction
1. Software for Modeling
Fundamentals for Modeling SoftwareTypes of ModelsDescriptive ModelsInferential ModelsPredictive ModelsConnections Between Types of ModelsSome TerminologyHow Does Modeling Fit into the Data Analysis Process?Chapter Summary
2. A Tidyverse Primer
Tidyverse PrinciplesDesign for HumansReuse Existing Data StructuresDesign for the Pipe and Functional ProgrammingExamples of Tidyverse SyntaxChapter Summary
3. A Review of R Modeling Fundamentals
An ExampleWhat Does the R Formula Do?Why Tidiness Is Important for ModelingCombining Base R Models and the TidyverseThe tidymodels MetapackageChapter Summary
II. Modeling Basics
4. The Ames Housing Data
Exploring Features of Homes in AmesChapter Summary
5. Spending Our Data
Common Methods for Splitting DataWhat About a Validation Set?Multilevel DataOther Considerations for a Data BudgetChapter Summary
6. Fitting Models with parsnip
Create a ModelUse the Model ResultsMake Predictionsparsnip-Extension PackagesCreating Model SpecificationsChapter Summary
7. A Model Workflow
Where Does the Model Begin and End?Workflow BasicsAdding Raw Variables to the workflow()How Does a workflow() Use the Formula?Tree-Based ModelsSpecial Formulas and Inline FunctionsCreating Multiple Workflows at OnceEvaluating the Test SetChapter Summary

8. Feature Engineering with Recipes
A Simple recipe() for the Ames Housing DataUsing RecipesHow Data Are Used by the recipe()Examples of StepsEncoding Qualitative Data in a Numeric FormatInteraction TermsSpline FunctionsFeature ExtractionRow Sampling StepsGeneral TransformationsNatural Language ProcessingSkipping Steps for New DataTidy a recipe()Column RolesChapter Summary
9. Judging Model Effectiveness
Performance Metrics and InferenceRegression MetricsBinary Classification MetricsMulticlass Classification MetricsChapter Summary
III. Tools for Creating Effective Models
10. Resampling for Evaluating Performance
The Resubstitution ApproachResampling MethodsCross-ValidationRepeated Cross-ValidationLeave-One-Out Cross-ValidationMonte Carlo Cross-ValidationValidation SetsBootstrappingRolling Forecasting Origin ResamplingEstimating PerformanceParallel ProcessingSaving the Resampled ObjectsChapter Summary
11. Comparing Models with Resampling
Creating Multiple Models with Workflow SetsComparing Resampled Performance StatisticsSimple Hypothesis Testing MethodsBayesian MethodsA Random Intercept ModelThe Effect of the Amount of ResamplingChapter Summary
12. Model Tuning and the Dangers of Overfitting
Model ParametersTuning Parameters for Different Types of ModelsWhat Do We Optimize?The Consequences of Poor Parameter EstimatesTwo General Strategies for OptimizationTuning Parameters in tidymodelsChapter Summary
13. Grid Search
Regular and Nonregular GridsRegular GridsNonregular GridsEvaluating the GridFinalizing the ModelTools for Creating Tuning SpecificationsTools for Efficient Grid SearchSubmodel OptimizationParallel ProcessingBenchmarking Boosted TreesAccess to Global VariablesRacing MethodsChapter Summary
14. Iterative Search
A Support Vector Machine ModelBayesian OptimizationA Gaussian Process ModelAcquisition FunctionsThe tune_bayes() FunctionSimulated AnnealingSimulated Annealing Search ProcessThe tune_sim_anneal() FunctionChapter Summary
15. Screening Many Models
Modeling Concrete Mixture StrengthCreating the Workflow SetTuning and Evaluating the ModelsEfficiently Screening ModelsFinalizing a ModelChapter Summary
IV. Beyond the Basics
16. Dimensionality Reduction
What Problems Can Dimensionality Reduction Solve?A Picture Is Worth a Thousand…BeansA Starter RecipeRecipes in the WildPreparing a RecipeBaking the RecipeFeature Extraction TechniquesPrincipal Component AnalysisPartial Least SquaresIndependent Component AnalysisUniform Manifold Approximation and ProjectionModelingChapter Summary
17. Encoding Categorical Data
Is an Encoding Necessary?Encoding Ordinal PredictorsUsing the Outcome for Encoding PredictorsEffect Encodings in tidymodelsEffect Encodings with Partial PoolingFeature HashingMore Encoding OptionsChapter Summary
18. Explaining Models and Predictions
Software for Model ExplanationsLocal ExplanationsGlobal ExplanationsBuilding Global Explanations from Local ExplanationsBack to Beans!Chapter Summary
19. When Should You Trust Your Predictions?
Equivocal ResultsDetermining Model ApplicabilityChapter Summary
20. Ensembles of Models
Creating the Training Set for StackingBlend the PredictionsFit the Member ModelsTest Set ResultsChapter Summary
21. Inferential Analysis
Inference for Count DataComparisons with Two-Sample TestsLog-Linear ModelsA More Complex ModelMore Inferential AnalysisChapter Summary
A. Recommended Preprocessing
References
Index
About the Authors

Content preview from Tidy Modeling with R

Chapter 11. Comparing Models with Resampling

Once we create two or more models, the next step is to compare them to understand which one is best. In some cases, comparisons might be within-model, where the same model might be evaluated with different features or preprocessing methods. Alternatively, between-model comparisons, such as when we compared linear regression and random forest models in Chapter 10, are the more common scenario.

In either case, the result is a collection of resampled summary statistics (e.g., RMSE, accuracy, etc.) for each model. In this chapter, we’ll first demonstrate how workflow sets can be used to fit multiple models. Then, we’ll discuss important aspects of resampling statistics. Finally, we’ll look at how to formally compare models (using either hypothesis testing or a Bayesian approach).

Creating Multiple Models with Workflow Sets

In Chapter 7 we described the idea of a workflow set where different preprocessors and/or models can be combinatorially generated. In Chapter 10, we used a recipe for the Ames data that included an interaction term as well as spline functions for longitude and latitude. To demonstrate more with workflow sets, let’s create three different linear models that add these preprocessing steps incrementally; we can test whether these additional terms improve the model results. We’ll create three recipes then combine them into a workflow set:

library(tidymodels)
tidymodels_prefer()

basic_rec <-
  recipe(Sale_Price ~ Neighborhood ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492096474Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Tidy Modeling with R

by Max Kuhn, Julia Silge

Chapter 11. Comparing Models with Resampling

Creating Multiple Models with Workflow Sets

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.