book

Tidy Modeling with R

by Max Kuhn, Julia Silge

July 2022

Beginner to intermediate

381 pages

9h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Introduction
1. Software for Modeling
Fundamentals for Modeling SoftwareTypes of ModelsDescriptive ModelsInferential ModelsPredictive ModelsConnections Between Types of ModelsSome TerminologyHow Does Modeling Fit into the Data Analysis Process?Chapter Summary
2. A Tidyverse Primer
Tidyverse PrinciplesDesign for HumansReuse Existing Data StructuresDesign for the Pipe and Functional ProgrammingExamples of Tidyverse SyntaxChapter Summary
3. A Review of R Modeling Fundamentals
An ExampleWhat Does the R Formula Do?Why Tidiness Is Important for ModelingCombining Base R Models and the TidyverseThe tidymodels MetapackageChapter Summary
II. Modeling Basics
4. The Ames Housing Data
Exploring Features of Homes in AmesChapter Summary
5. Spending Our Data
Common Methods for Splitting DataWhat About a Validation Set?Multilevel DataOther Considerations for a Data BudgetChapter Summary
6. Fitting Models with parsnip
Create a ModelUse the Model ResultsMake Predictionsparsnip-Extension PackagesCreating Model SpecificationsChapter Summary
7. A Model Workflow
Where Does the Model Begin and End?Workflow BasicsAdding Raw Variables to the workflow()How Does a workflow() Use the Formula?Tree-Based ModelsSpecial Formulas and Inline FunctionsCreating Multiple Workflows at OnceEvaluating the Test SetChapter Summary

8. Feature Engineering with Recipes
A Simple recipe() for the Ames Housing DataUsing RecipesHow Data Are Used by the recipe()Examples of StepsEncoding Qualitative Data in a Numeric FormatInteraction TermsSpline FunctionsFeature ExtractionRow Sampling StepsGeneral TransformationsNatural Language ProcessingSkipping Steps for New DataTidy a recipe()Column RolesChapter Summary
9. Judging Model Effectiveness
Performance Metrics and InferenceRegression MetricsBinary Classification MetricsMulticlass Classification MetricsChapter Summary
III. Tools for Creating Effective Models
10. Resampling for Evaluating Performance
The Resubstitution ApproachResampling MethodsCross-ValidationRepeated Cross-ValidationLeave-One-Out Cross-ValidationMonte Carlo Cross-ValidationValidation SetsBootstrappingRolling Forecasting Origin ResamplingEstimating PerformanceParallel ProcessingSaving the Resampled ObjectsChapter Summary
11. Comparing Models with Resampling
Creating Multiple Models with Workflow SetsComparing Resampled Performance StatisticsSimple Hypothesis Testing MethodsBayesian MethodsA Random Intercept ModelThe Effect of the Amount of ResamplingChapter Summary
12. Model Tuning and the Dangers of Overfitting
Model ParametersTuning Parameters for Different Types of ModelsWhat Do We Optimize?The Consequences of Poor Parameter EstimatesTwo General Strategies for OptimizationTuning Parameters in tidymodelsChapter Summary
13. Grid Search
Regular and Nonregular GridsRegular GridsNonregular GridsEvaluating the GridFinalizing the ModelTools for Creating Tuning SpecificationsTools for Efficient Grid SearchSubmodel OptimizationParallel ProcessingBenchmarking Boosted TreesAccess to Global VariablesRacing MethodsChapter Summary
14. Iterative Search
A Support Vector Machine ModelBayesian OptimizationA Gaussian Process ModelAcquisition FunctionsThe tune_bayes() FunctionSimulated AnnealingSimulated Annealing Search ProcessThe tune_sim_anneal() FunctionChapter Summary
15. Screening Many Models
Modeling Concrete Mixture StrengthCreating the Workflow SetTuning and Evaluating the ModelsEfficiently Screening ModelsFinalizing a ModelChapter Summary
IV. Beyond the Basics
16. Dimensionality Reduction
What Problems Can Dimensionality Reduction Solve?A Picture Is Worth a Thousand…BeansA Starter RecipeRecipes in the WildPreparing a RecipeBaking the RecipeFeature Extraction TechniquesPrincipal Component AnalysisPartial Least SquaresIndependent Component AnalysisUniform Manifold Approximation and ProjectionModelingChapter Summary
17. Encoding Categorical Data
Is an Encoding Necessary?Encoding Ordinal PredictorsUsing the Outcome for Encoding PredictorsEffect Encodings in tidymodelsEffect Encodings with Partial PoolingFeature HashingMore Encoding OptionsChapter Summary
18. Explaining Models and Predictions
Software for Model ExplanationsLocal ExplanationsGlobal ExplanationsBuilding Global Explanations from Local ExplanationsBack to Beans!Chapter Summary
19. When Should You Trust Your Predictions?
Equivocal ResultsDetermining Model ApplicabilityChapter Summary
20. Ensembles of Models
Creating the Training Set for StackingBlend the PredictionsFit the Member ModelsTest Set ResultsChapter Summary
21. Inferential Analysis
Inference for Count DataComparisons with Two-Sample TestsLog-Linear ModelsA More Complex ModelMore Inferential AnalysisChapter Summary
A. Recommended Preprocessing
References
Index
About the Authors

Content preview from Tidy Modeling with R

Chapter 8. Feature Engineering with Recipes

Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. This includes transformations and encodings of the data to best represent their important characteristics. Imagine that you have two predictors in a data set that can be more effectively represented in your model as a ratio; creating a new predictor from the ratio of the original two is a simple example of feature engineering.

Take the location of a house in Ames as a more involved example. There are a variety of ways that this spatial information can be exposed to a model, including neighborhood (a qualitative measure), longitude/latitude, distance to the nearest school, and so on. When choosing how to encode these data in modeling, we might choose an option we believe is most associated with the outcome. The original format of the data, for example numeric (e.g., distance) versus categorical (e.g., neighborhood), is also a driving factor in feature engineering choices.

Other examples of preprocessing to build better features for modeling include:

Correlation between predictors can be reduced via feature extraction or the removal of some predictors.
When some predictors have missing values, they can be imputed using a sub-model.
Models that use variance-type measures may benefit from coercing the distribution of some skewed predictors to be symmetric by estimating a transformation.

Feature engineering and data preprocessing ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492096474Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Tidy Modeling with R

by Max Kuhn, Julia Silge

Chapter 8. Feature Engineering with Recipes

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.