Chapter 3. Data Wrangling
3.0 Introduction
Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is an important step.
The most common data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic:
# Load libraryimportpandasaspd# Create URLurl='https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'# Load data as a dataframedataframe=pd.read_csv(url)# Show first 5 rowsdataframe.head(5)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.00 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.00 | female | 0 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.00 | male | 0 | 0 |
| 3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25.00 | female | 0 | 1 |
| 4 | Allison, Master Hudson Trevor | 1st | 0.92 | male | 1 | 0 |
There are three important things to notice in this data frame.
First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, ...