Chapter 20

Ten (or So) Best Practices in Data Preparation

In This Chapter

arrow Understanding the key steps in data validation

arrow Preparing data for analysis

The main goal of this book is to get you familiar with the statistical methods that allow you to build useful statistical models. But as you’ve probably noticed, we have spent a great deal of time, particularly in Part II, talking about getting data ready for analysis. Statistical software packages are extremely powerful these days, but they cannot overcome poor quality data. This chapter provides a checklist of things you need to do before you go off building statistical models.

Check Data Formats

Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. And in the age of big data, you will surely be faced with data from a variety of sources. Your first step in analyzing your data is making sure you can read the files you’re given. Chapter 7 gives some tips about how to do this.

Chapter 6 talks about the formats of the individual data fields, or variables, in your data file. You need to actually look at what each field contains. For example, it’s not wise to trust that ...

Get Statistics for Big Data For Dummies now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.