Principles of Data Wrangling
by Joseph M. Hellerstein, Tye Rattenbury, Jeffrey Heer, Sean Kandel, Connor Carreras
Chapter 3. The Dynamics of Data Wrangling
In Chapter 2, we introduced a framework capturing the variety of actions involved in working with data. Each of these actions involves some amount of data wrangling. In this chapter, we describe the dynamics of data wrangling, the breadth of transformations and profiling required to wrangle data, and how these aspects of data wrangling vary by action in our framework.
Data Wrangling Dynamics
Data wrangling is a generic phrase capturing the range of tasks involved in preparing your data for analysis. Data wrangling begins with accessing your data. Sometimes, access is gated on getting appropriate permission and making the corresponding changes in your data infrastructure. Access also involves manipulating the locations and relationships between datasets. This kind of data wrangling involves everything from moving datasets around a folder hierarchy, to replicating datasets across warehouses for easier access, to analyzing differences between similar datasets and assessing overlaps and conflicts.
After you have successfully accessed your data, the bulk of your data wrangling work will involve transforming the data itself—manipulating the structure, granularity, accuracy, temporality, and scope of your data to better align with your analysis goals. All of these transformations are best performed with tools that provide meaningful feedback (so that the manipulator is assured that the manipulations were successful). We refer to this feedback ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access