Chapter 4. Managing data

This chapter covers

  • Fixing data quality problems
  • Organizing your data for the modeling process

In chapter 3, you learned how to explore your data and to identify common data issues. In this chapter, you’ll see how to fix the data issues that you’ve discovered. After that, we’ll talk about organizing the data for the modeling process.[1]

1 For all of the examples in this chapter, we’ll use synthetic customer data (mostly derived from US Census data) with specifically introduced flaws. The data can be loaded by saving the file exampleData.rData from https://github.com/WinVector/zmPDSwR/tree/master/Custdata and then running load("exampleData.rData") in R.

4.1. Cleaning data

In this section, we’ll address issues ...

Get Practical Data Science with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.