Chapter 3. Data Wrangling
3.0 Introduction
Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is an important step.
The most common data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic:
# Load library
import
pandas
as
pd
# Create URL
url
=
'https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'
# Load data as a dataframe
dataframe
=
pd
.
read_csv
(
url
)
# Show first 5 rows
dataframe
.
head
(
5
)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
0 | Allen, Miss Elisabeth Walton | 1st | 29.00 | female | 1 | 1 |
1 | Allison, Miss Helen Loraine | 1st | 2.00 | female | 0 | 1 |
2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.00 | male | 0 | 0 |
3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25.00 | female | 0 | 1 |
4 | Allison, Master Hudson Trevor | 1st | 0.92 | male | 1 | 0 |
There are three important things to notice in this data frame.
First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, ...
Get Machine Learning with Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.