1Basic Principles of Data Wrangling

Akshay Singh*, Surender Singh and Jyotsna Rathee

Department of Information Technology, Maharaja Surajmal Institute of Technology, Janakpuri, New Delhi, India

Abstract

Data wrangling is considered to be a crucial step of data science lifecycle. The quality of data analysis directly depends on the quality of data itself. As the data sources are increasing with a fast pace, it is more than essential to organize the data for analysis. The process of cleaning, structuring, and enriching raw data into the required data format in order to make better judgments in less time is known as data wrangling. It entails the manual conversion and mapping of data from one raw form to another in order to facilitate data consumption and organization. It is also known as data munging, meaning “digestible” data. The iterative process of gathering, filtering, converting, exploring, and integrating data come under the data wrangling pipeline. The foundation of data wrangling is data gathering. The data is extracted, parsed, and scraped before the process of removing unnecessary information from raw data. Data filtering or scrubbing includes removing corrupt and invalid data, thus keeping only the needful data. The data is transformed from unstructured to a bit structured form. Then, the data is converted from one format to another format. To name a few, some common formats are CSV, JSON, XML, SQL, etc. The preanalysis of data is to be done in data exploration step. ...

Get Data Wrangling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.