Chapter 8Extended Exercise

In this chapter, we set up a guided data cleaning task, from beginning to end. This data (including the company and personal names and addresses) is entirely fabricated and is intended only to demonstrate some of the concepts in this book – but every quirk you see in this data is based on actual data we encountered doing real projects. Unlike the smaller examples in earlier chapters, these data sets are large enough so that you cannot spot all of their anomalies by eye. However, you will be able to open and examine these outside of R – unlike some of the data sets we deal with in real life.

This exercise requires time and focus to complete. You will get the most benefit from this book if you read the chapter all the way through and try to perform all the tasks in the exercise. Part of the exercise is figuring out exactly what needs to be done at each step and in which order. We have included some “pseudocode” – high-level descriptions of the algorithms we used to perform the tasks – and some hints in Appendix A. However, we recommend you only use that when you find yourself stuck. The actual code we used to do the cleaning is available in the cleaningBook package. Again, you will receive the most benefit when you try to solve the problem in it entirely before looking at our code – and you may find a different, or better, way of getting the job done.

8.1 Introduction to the Problem

This data comes from a hypothetical client company called Hardy Business ...

Get A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.