book

Tidy Modeling with R

by Max Kuhn, Julia Silge

July 2022

Beginner to intermediate

381 pages

9h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Introduction
1. Software for Modeling
Fundamentals for Modeling SoftwareTypes of ModelsDescriptive ModelsInferential ModelsPredictive ModelsConnections Between Types of ModelsSome TerminologyHow Does Modeling Fit into the Data Analysis Process?Chapter Summary
2. A Tidyverse Primer
Tidyverse PrinciplesDesign for HumansReuse Existing Data StructuresDesign for the Pipe and Functional ProgrammingExamples of Tidyverse SyntaxChapter Summary
3. A Review of R Modeling Fundamentals
An ExampleWhat Does the R Formula Do?Why Tidiness Is Important for ModelingCombining Base R Models and the TidyverseThe tidymodels MetapackageChapter Summary
II. Modeling Basics
4. The Ames Housing Data
Exploring Features of Homes in AmesChapter Summary
5. Spending Our Data
Common Methods for Splitting DataWhat About a Validation Set?Multilevel DataOther Considerations for a Data BudgetChapter Summary
6. Fitting Models with parsnip
Create a ModelUse the Model ResultsMake Predictionsparsnip-Extension PackagesCreating Model SpecificationsChapter Summary
7. A Model Workflow
Where Does the Model Begin and End?Workflow BasicsAdding Raw Variables to the workflow()How Does a workflow() Use the Formula?Tree-Based ModelsSpecial Formulas and Inline FunctionsCreating Multiple Workflows at OnceEvaluating the Test SetChapter Summary

8. Feature Engineering with Recipes
A Simple recipe() for the Ames Housing DataUsing RecipesHow Data Are Used by the recipe()Examples of StepsEncoding Qualitative Data in a Numeric FormatInteraction TermsSpline FunctionsFeature ExtractionRow Sampling StepsGeneral TransformationsNatural Language ProcessingSkipping Steps for New DataTidy a recipe()Column RolesChapter Summary
9. Judging Model Effectiveness
Performance Metrics and InferenceRegression MetricsBinary Classification MetricsMulticlass Classification MetricsChapter Summary
III. Tools for Creating Effective Models
10. Resampling for Evaluating Performance
The Resubstitution ApproachResampling MethodsCross-ValidationRepeated Cross-ValidationLeave-One-Out Cross-ValidationMonte Carlo Cross-ValidationValidation SetsBootstrappingRolling Forecasting Origin ResamplingEstimating PerformanceParallel ProcessingSaving the Resampled ObjectsChapter Summary
11. Comparing Models with Resampling
Creating Multiple Models with Workflow SetsComparing Resampled Performance StatisticsSimple Hypothesis Testing MethodsBayesian MethodsA Random Intercept ModelThe Effect of the Amount of ResamplingChapter Summary
12. Model Tuning and the Dangers of Overfitting
Model ParametersTuning Parameters for Different Types of ModelsWhat Do We Optimize?The Consequences of Poor Parameter EstimatesTwo General Strategies for OptimizationTuning Parameters in tidymodelsChapter Summary
13. Grid Search
Regular and Nonregular GridsRegular GridsNonregular GridsEvaluating the GridFinalizing the ModelTools for Creating Tuning SpecificationsTools for Efficient Grid SearchSubmodel OptimizationParallel ProcessingBenchmarking Boosted TreesAccess to Global VariablesRacing MethodsChapter Summary
14. Iterative Search
A Support Vector Machine ModelBayesian OptimizationA Gaussian Process ModelAcquisition FunctionsThe tune_bayes() FunctionSimulated AnnealingSimulated Annealing Search ProcessThe tune_sim_anneal() FunctionChapter Summary
15. Screening Many Models
Modeling Concrete Mixture StrengthCreating the Workflow SetTuning and Evaluating the ModelsEfficiently Screening ModelsFinalizing a ModelChapter Summary
IV. Beyond the Basics
16. Dimensionality Reduction
What Problems Can Dimensionality Reduction Solve?A Picture Is Worth a Thousand…BeansA Starter RecipeRecipes in the WildPreparing a RecipeBaking the RecipeFeature Extraction TechniquesPrincipal Component AnalysisPartial Least SquaresIndependent Component AnalysisUniform Manifold Approximation and ProjectionModelingChapter Summary
17. Encoding Categorical Data
Is an Encoding Necessary?Encoding Ordinal PredictorsUsing the Outcome for Encoding PredictorsEffect Encodings in tidymodelsEffect Encodings with Partial PoolingFeature HashingMore Encoding OptionsChapter Summary
18. Explaining Models and Predictions
Software for Model ExplanationsLocal ExplanationsGlobal ExplanationsBuilding Global Explanations from Local ExplanationsBack to Beans!Chapter Summary
19. When Should You Trust Your Predictions?
Equivocal ResultsDetermining Model ApplicabilityChapter Summary
20. Ensembles of Models
Creating the Training Set for StackingBlend the PredictionsFit the Member ModelsTest Set ResultsChapter Summary
21. Inferential Analysis
Inference for Count DataComparisons with Two-Sample TestsLog-Linear ModelsA More Complex ModelMore Inferential AnalysisChapter Summary
A. Recommended Preprocessing
References
Index
About the Authors

Content preview from Tidy Modeling with R

Chapter 17. Encoding Categorical Data

For statistical modeling in R, the preferred representation for categorical or nominal data is a factor, a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.¹ In Chapter 8 we introduced feature engineering approaches, including those to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the Bldg_Type in our Ames housing data (with levels OneFam, TwoFmCon, Duplex, Twnhs, and TwnhsE), to a set of dummy or indicator variables like those shown in Table 17-1.

Table 17-1. Illustration of binary encodings (i.e., dummy variables) for a qualitative predictor
Raw data	TwoFmCon	Duplex	Twnhs	TwnhsE
OneFam	0	0	0	0
TwoFmCon	1	0	0	0
Duplex	0	1	0	0
Twnhs	0	0	1	0
TwnhsE	0	0	0	1

Many model implementations require such a transformation to a numeric representation for categorical data.

Note

The Appendix presents a table of recommended preprocessing techniques for different models; notice how many of the models in the table require a numeric encoding for all predictors.

However, for some realistic data sets, straightforward dummy variables are not a good fit. This often happens because there are too many categories or there are new categories at prediction time. In this chapter, we discuss more sophisticated options ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492096474Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Tidy Modeling with R

by Max Kuhn, Julia Silge

Chapter 17. Encoding Categorical Data

Note

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.