Chapter 6. Efficient Data Carpentry
There are many words for data processing. You can clean, hack, manipulate, munge, refine, and tidy your dataset, ready for the next stage. Each word says something about perceptions that people have about the process: data processing is often seen as dirty work, an unpleasant necessity that must be endured before the real fun and important work begins. This perception is wrong. Getting your data ship-shape is a respectable and in some cases vital skill. For this reason, we use the more admirable term data carpentry.
This metaphor is not accidental. Carpentry is the process of taking rough pieces of wood and working with care, diligence, and precision to create a finished product. A carpenter does not hack at the wood at random. He or she will inspect the raw material and select the right tool for the job. In the same way, data carpentry is the process of taking rough, raw, and to some extent randomly arranged input data and creating neatly organized and tidy data. Learning the skill of data carpentry early will yield benefits for years to come. “Give me six hours to chop down a tree and I will spend the first four sharpening the axe,” as the saying goes.
Data processing is a critical stage in any project involving datasets from external sources (i.e., most real-world applications). In the same way that technical debt, discussed in Chapter 5, can cripple your workflow, working with messy data can lead to project management hell.