book

Python Data Cleaning Cookbook - Second Edition

Name: Python Data Cleaning Cookbook - Second Edition
Author: Michael Walker
ISBN: 9781803239873

by Michael Walker

May 2024

Intermediate to advanced

486 pages

11h 33m

English

Packt Publishing

Read now

Unlock full access

Preface
New in the Second EditionWho this book is forWhat this book coversGet in touch
Anticipating Data Cleaning Issues When Importing Tabular Data with pandas
Technical requirementsImporting CSV filesImporting Excel filesImporting data from SQL databasesImporting SPSS, Stata, and SAS dataImporting R dataPersisting tabular dataSummary
Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data
Technical requirementsImporting simple JSON dataImporting more complicated JSON data from an APIImporting data from web pagesWorking with Spark dataPersisting JSON dataVersioning dataSummary
Taking the Measure of Your Data
Technical requirementsGetting a first look at your dataSelecting and organizing columnsSelecting rowsGenerating frequencies for categorical variablesGenerating summary statistics for continuous variablesUsing generative AI to display descriptive statisticsSummary
Identifying Outliers in Subsets of Data
Technical requirementsIdentifying outliers with one variableIdentifying outliers and unexpected values in bivariate relationshipsUsing subsetting to examine logical inconsistencies in variable relationshipsUsing linear regression to identify data points with significant influenceUsing k-nearest neighbors to find outliersUsing Isolation Forest to find anomaliesUsing PandasAI to identify outliersSummary
Using Visualizations for the Identification of Unexpected Values
Technical requirementsUsing histograms to examine the distribution of continuous variablesUsing boxplots to identify outliers for continuous variablesUsing grouped boxplots to uncover unexpected values in a particular groupExamining both distribution shape and outliers with violin plotsUsing scatter plots to view bivariate relationshipsUsing line plots to examine trends in continuous variablesGenerating a heat map based on a correlation matrixSummary
Cleaning and Exploring Data with Series Operations
Technical requirementsGetting values from a pandas SeriesShowing summary statistics for a pandas SeriesChanging Series valuesChanging Series values conditionallyEvaluating and cleaning string Series dataWorking with datesUsing OpenAI for Series operationsSummary
Identifying and Fixing Missing Values
Technical requirementsIdentifying missing valuesCleaning missing valuesImputing values with regressionUsing k-nearest neighbors for imputationUsing random forest for imputationUsing PandasAI for imputationSummary
Encoding, Transforming, and Scaling Features
Technical requirementsCreating training datasets and avoiding data leakageRemoving redundant or unhelpful featuresEncoding categorical features: one-hot encodingEncoding categorical features: ordinal encodingEncoding categorical features with medium or high cardinalityUsing mathematical transformationsFeature binning: equal width and equal frequencyk-means binningFeature scalingSummary
Fixing Messy Data When Aggregating
Technical requirementsLooping through data with itertuples (an anti-pattern)Calculating summaries by group with NumPy arraysUsing groupby to organize data by groupsUsing more complicated aggregation functions with groupbyUsing user-defined functions and apply with groupbyUsing groupby to change the unit of analysis of a DataFrameUsing pivot_table to change the unit of analysis of a DataFrameSummary

Addressing Data Issues When Combining DataFrames
Technical requirementsCombining DataFrames verticallyDoing one-to-one mergesDoing one-to-one merges by multiple columnsDoing one-to-many mergesDoing many-to-many mergesDeveloping a merge routineSummary
Tidying and Reshaping Data
Technical requirementsRemoving duplicated rowsFixing many-to-many relationshipsUsing stack and melt to reshape data from wide to long formatMelting multiple groups of columnsUsing unstack and pivot to reshape data from long to wide formatSummary
Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines
Technical requirementsFunctions for getting a first look at our dataFunctions for displaying summary statistics and frequenciesFunctions for identifying outliers and unexpected valuesFunctions for aggregating or combining dataClasses that contain the logic for updating Series valuesClasses that handle non-tabular data structuresFunctions for checking overall data qualityPre-processing data with pipelines: a simple examplePre-processing data with pipelines: a more complicated exampleSummary
Index

Content preview from Python Data Cleaning Cookbook - Second Edition

2 Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

This chapter continues our work on importing data from a variety of sources and the initial checks we should do on the data after importing it. Over the last 25 years, data analysts have found that they increasingly need to work with data in non-tabular, semi-structured forms. Sometimes, they even create and persist data in those forms. We will work with a common alternative to traditional tabular datasets in this chapter, JSON, but the general concepts can be extended to XML and NoSQL data stores such as MongoDB. We will also go over common issues that occur when scraping data from websites.

Data analysts have also been finding that increases in the volume of data ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781803239873

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Cleaning Cookbook - Second Edition

by Michael Walker

2

Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.