Chapter 4. Managing data

This chapter covers

  • Fixing data quality problems
  • Organizing your data for the modeling process

In chapter 3, you learned how to explore your data and to identify common data issues. In this chapter, you’ll see how to fix the data issues that you’ve discovered. After that, we’ll talk about organizing the data for the modeling process.[1]

1 For all of the examples in this chapter, we’ll use synthetic customer data (mostly derived from US Census data) with specifically introduced flaws. The data can be loaded by saving the file exampleData.rData from https://github.com/WinVector/zmPDSwR/tree/master/Custdata and then running load("exampleData.rData") in R.

4.1. Cleaning data

In this section, we’ll address issues ...

Get Practical Data Science with R now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.