8 Advanced data preparation

This chapter covers

  • Using the vtreat package for advanced data preparation
  • Cross-validated data preparation

In our last chapter, we built substantial models on nice or well-behaved data. In this chapter, we will learn how to prepare or treat messy real-world data for modeling. We will use the principles of chapter 4 and the advanced data preparation package: vtreat. We will revisit the issues that arise with missing values, categorical variables, recoding variables, redundant variables, and having too many variables. We will spend some time on variable selection, which is an important step even with current machine learning methods. The mental model summary (figure 8.1) of this chapter emphasizes that this chapter ...

Get Practical Data Science with R, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.