Appendix A

Hints and Pseudocode

Chapter 8 described a data handling task involving acquiring data from spreadsheets, a database, JSON, XML, and fixed-width text files. The formats and layouts of the data are documented in that chapter. In this appendix, we give some extra hints about how to proceed. We recommend trying the exercise first, without referring to this appendix until you need to.

Some of these hints come in the form of “pseudocode.” This is the programmer's term for instructions that describe an algorithm in ordinary English, rather than in the strict form of code in R or another computer language. The actual R code we used can be found in the cleaningBook package.

A.1 Loan Portfolios

Reading, cleaning, and combining the loan portfolios (Section 8.4) is the first task in the exercise, and perhaps the most time-consuming. However, none of the tasks needed to complete this task is particularly challenging from a technical standpoint.

If you have a spreadsheet program such as Excel that can open the file, the very first step in this process might be to use that program to view the file. Look at the values. Are there headers? Can you see any missing value codes? Are some columns empty? Do any rows appear to be missing a lot of values? Are values unexpectedly duplicated across rows? Are there dates, currency amounts, or other values that might require special handling?

Then, it is time to read the two data sets into R and produce data frames, using one of the read.table() ...

Get A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.