Chapter 8. Data Wrangling Service

With the data now aggregated within the lake, we are now ready to focus on wrangling the data, which typically includes structuring, cleaning, enriching, and validating the data. Wrangling is an iterative process to curate errors, outliers, missing values, imputing values, data imbalance, and data encoding. Each step during the process exposes new potential ways that the data might be “re-wrangled,” with the goal of generating the most robust data values for generating the insights. Also, wrangling provides insights into the nature of data, allowing us to ask better questions for generating insights.

Data scientists spend a significant amount of time and manual effort on wrangling (as shown in Figure 8-1). In addition to being time-consuming, wrangling is incomplete, unreliable, and error prone, and comes with several pain points. First, data users touch on a large number of datasets during exploratory analysis, so it is critical to discover the properties of the data and detect wrangling transformations required for preparation quickly. Currently, evaluating dataset properties and determining the wrangling to be applied is ad hoc and manual. Second, applying wrangling transformations requires writing idiosyncratic scripts in programming languages like Python, Perl, and R, or engaging in tedious manual editing using tools like Microsoft Excel. Given the growing volume, velocity, and variety of the data, the data users require low-level coding ...

Get The Self-Service Data Roadmap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.