Skip to Content
Learning Data Science
book

Learning Data Science

by Sam Lau, Joseph Gonzalez, Deborah Nolan
September 2023
Beginner
596 pages
15h 31m
English
O'Reilly Media, Inc.
Content preview from Learning Data Science

Chapter 9. Wrangling Dataframes

We often need to perform preparatory work on our data before we can begin our analysis. The amount of preparation can vary widely, but there are a few basic steps to move from raw data to data ready for analysis. Chapter 8 addressed the initial steps of creating a dataframe from a plain-text source. In this chapter, we assess quality. To do this, we perform validity checks on individual data values and entire columns. In addition to checking the quality of the data, we determine whether or not the data need to be transformed and reshaped to get ready for analysis. Quality checking (and fixing) and transformation are often cyclical: the quality checks point us toward transformations we need to make, and when we check the transformed columns to confirm that our data are ready for analysis, we may discover they need further cleaning.

Depending on the data source, we often have different expectations for quality. Some datasets require extensive wrangling to get them into an analyzable form, and others arrive clean and we can quickly launch into modeling. Here are some examples of data sources and how much wrangling we might expect to do:

  • Data from a scientific experiment or study are typically clean, are well documented, and have a simple structure. These data are organized to be broadly shared so that others can build on or reproduce the findings. They are typically ready for analysis after little to no wrangling.

  • Data from government surveys often ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Dive Into Data Science

Dive Into Data Science

Bradford Tuckfield
Introducing Data Science

Introducing Data Science

Arno Meysman, Davy Cielen, Mohamed Ali

Publisher Resources

ISBN: 9781098112998Errata PageSupplemental Content