Chapter 3. Data Wrangling
3.0 Introduction
Data wrangling is a broad term used, often informally, to describe the process of transforming raw data into a clean, organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is an important step.
The most common data structure used to “wrangle” data is the dataframe, which can be both intuitive and incredibly versatile. Dataframes are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a dataframe created from data about passengers on the Titanic:
# Load libraryimportpandasaspd# Create URLurl='https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'# Load data as a dataframedataframe=pd.read_csv(url)# Show first five rowsdataframe.head(5)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.00 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.00 | female | 0 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.00 | male | 0 | 0 |
| 3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25.00 | female | 0 | 1 |
| 4 | Allison, Master Hudson Trevor | 1st | 0.92 | male | 1 | 0 |
There are three important things to notice in this dataframe.
First, in a dataframe each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, was ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access