Chapter ThreePitfall 2: Technical Trespasses
“All I have to do is work on transition and technique.”
—Usain Bolt
How We Process Data
Now that we've had a chance to clear the air about some important philosophical concepts, let's dive into a highly technical part of the data working process that typically happens at the very beginning. Some call it data wrangling; some call it data munging. It's the not-so-glamorous process of getting your data into the proper condition and shape to do the analysis in the first place.
If we compare the data working process to building a house, these data preparation steps are kind of like laying the foundation, and installing the plumbing and the electrical. When it's all said and done, you don't really see any of those things, but if they screwed them up, you're sure not going to want to live there. And working on these parts of the house after people have moved in only gets messier and more difficult.
But this part of the process isn't just critical for the rest of the endeavor; it's also typical for it to take the bulk of the time. An oft-cited figure is that cleaning, structuring, and preparing your data for analysis can account for as much as 50 to 80% of the overall time of the data working project.1
So then, identifying and learning to avoid the pitfalls in these critical, time-consuming, and, let's be honest, tedious steps in the process is really important to our success.
Let's start by accepting a few fundamental principles of data ...
Get Avoiding Data Pitfalls now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.