book

Python Data Cleaning Cookbook

Name: Python Data Cleaning Cookbook
Author: Michael Walker
ISBN: 9781800565661

by Michael Walker

December 2020

Beginner to intermediate

436 pages

8h 23m

English

Packt Publishing

Read now

Unlock full access

Python Data Cleaning Cookbook
Why subscribe?ContributorsAbout the authorAbout the reviewersPackt is searching for authors like you
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedSectionsGetting readyHow to do it…How it works…There's more…See alsoGet in touchReviews
Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas
Technical requirementsImporting CSV filesGetting readyHow to do it…How it works...There's more...See alsoImporting Excel filesGetting readyHow to do it…How it works…There's more…See alsoImporting data from SQL databasesGetting readyHow to do it...How it works…There's more…See alsoImporting SPSS, Stata, and SAS dataGetting readyHow to do it...How it works...There's more…See alsoImporting R dataGetting readyHow to do it…How it works…There's more…See alsoPersisting tabular dataGetting readyHow to do it…How it works...There's more...
Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas
Technical requirementsImporting simple JSON dataGetting readyHow to do it…How it works…There's more…Importing more complicated JSON data from an APIGetting readyHow to do it...How it works…There's more…See alsoImporting data from web pagesGetting readyHow to do it…How it works…There's more…Persisting JSON dataGetting readyHow to do it...How it works…There's more…
Chapter 3: Taking the Measure of Your Data
Technical requirements Getting a first look at your dataGetting ready…How to do it...How it works…There's more...See alsoSelecting and organizing columnsGetting ready…How to do it…How it works…There's more…See alsoSelecting rowsGetting ready...How to do it...How it works…There's more…See alsoGenerating frequencies for categorical variablesGetting ready…How to do it…How it works…There's more…Generating summary statistics for continuous variablesGetting ready…How to do it…How it works…See also
Chapter 4: Identifying Missing Values and Outliers in Subsets of Data
Technical requirementsFinding missing valuesGetting readyHow to do it…How it works...See alsoIdentifying outliers with one variableGetting readyHow to do it...How it works…There's more…See alsoIdentifying outliers and unexpected values in bivariate relationshipsGetting readyHow to do it...How it works…There's more…See alsoUsing subsetting to examine logical inconsistencies in variable relationshipsGetting readyHow to do it…How it works…See alsoUsing linear regression to identify data points with significant influenceGetting readyHow to do it…How it works...There's more…Using k-nearest neighbor to find outliersGetting readyHow to do it…How it works...There's more...See alsoUsing Isolation Forest to find anomaliesGetting readyHow to do it...How it works…There's more…See also
Chapter 5: Using Visualizations for the Identification of Unexpected Values
Technical requirements Using histograms to examine the distribution of continuous variablesGetting readyHow to do it…How it works…There's more...Using boxplots to identify outliers for continuous variablesGetting readyHow to do it…How it works...There's more...See alsoUsing grouped boxplots to uncover unexpected values in a particular groupGetting readyHow to do it...How it works...There's more…See alsoExamining both the distribution shape and outliers with violin plotsGetting readyHow to do it…How it works…There's more…See alsoUsing scatter plots to view bivariate relationshipsGetting readyHow to do it...How it works…There's more...See alsoUsing line plots to examine trends in continuous variablesGetting readyHow to do it…How it works...There's more…See alsoGenerating a heat map based on a correlation matrixGetting readyHow to do it…How it works…There's more…See also
Chapter 6: Cleaning and Exploring Data with Series Operations
Technical requirementsGetting values from a pandas seriesGetting readyHow to do it…How it works...Showing summary statistics for a pandas seriesGetting readyHow to do it...How it works…There's more…See alsoChanging series valuesGetting readyHow to do it…How it works…There's more…See alsoChanging series values conditionallyGetting readyHow to do it…How it works…There's more…See alsoEvaluating and cleaning string series dataGetting readyHow to do it...How it works...There's more…Working with datesGetting readyHow to do it…How it works…See alsoIdentifying and cleaning missing dataGetting readyHow to do it…How it works…There's more...See alsoMissing value imputation with K-nearest neighborGetting readyHow to do it…How it works…There's more...See also
Chapter 7: Fixing Messy Data when Aggregating
Technical requirementsLooping through data with itertuples (an anti-pattern)Getting readyHow to do it…How it works...There's more...Calculating summaries by group with NumPy arraysGetting readyHow to do it…How it works…There's more…See alsoUsing groupby to organize data by groupsGetting readyHow to do it…How it works...There's more...Using more complicated aggregation functions with groupbyGetting readyHow to do it…How it works…There's more…See alsoUsing user-defined functions and apply with groupbyGetting readyHow to do it…How it works...There's more...See alsoUsing groupby to change the unit of analysis of a DataFrameGetting readyHow to do it...How it works…
Chapter 8: Addressing Data Issues When Combining DataFrames
Technical requirementsCombining DataFrames verticallyGetting readyHow to do it…How it works...See alsoDoing one-to-one mergesGetting readyHow to do it...How it works...There's more...Using multiple merge-by columnsGetting readyHow to do it...How it works...There's more...Doing one-to-many mergesGetting readyHow to do it…How it works...There's more…See alsoDoing many-to-many mergesGetting readyHow to do it...How it works...There's more...Developing a merge routineGetting readyHow to do it…How it works...See also

Chapter 9: Tidying and Reshaping Data
Technical requirementsRemoving duplicated rowsGetting ready...How to do it…How it works...There's more...See also...Fixing many-to-many relationshipsGetting ready...How to do it…How it works...There's more...See also...Using stack and melt to reshape data from wide to long formatGetting ready...How to do it…How it works...Melting multiple groups of columnsGetting ready...How to do it…How it works...There's more...Using unstack and pivot to reshape data from long to wideGetting ready...How to do it…How it works...
Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning
Technical requirements Functions for getting a first look at our dataGetting ready...How to do it...How it works...There's more...Functions for displaying summary statistics and frequenciesGetting readyHow to do it...How it works...There's more...See also...Functions for identifying outliers and unexpected valuesGetting readyHow to do it...How it works...There's more...See alsoFunctions for aggregating or combining dataGetting readyHow to do it...How it works...There's more...See alsoClasses that contain the logic for updating series valuesGetting readyHow to do it...How it works...There's more...See alsoClasses that handle non-tabular data structuresGetting readyHow to do it...How it works...There's more...
Other Books You May Enjoy
Leave a review - let other readers know what you think

Content preview from Python Data Cleaning Cookbook

Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

Outliers and unexpected values may not be errors. They often are not. Individuals and events are complicated and surprise the analyst. Some people really are 7'4" tall and some really have $50 million salaries. Sometimes, data is messy because people and situations are messy; however, extreme values can have an outsized impact on our analysis, particularly when we are using parametric techniques that assume a normal distribution.

These issues may become even more apparent when working with subsets of data. That is not just because extreme or unexpected values have more weight in smaller samples. It is also because they may make less sense when bivariate and multivariate relationships ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781800565661

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Cleaning Cookbook

by Michael Walker

Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.