book

Cleaning Data for Effective Data Science

Name: Cleaning Data for Effective Data Science
Author: David Mertz
ISBN: 9781801071291

by David Mertz

March 2021

Beginner to intermediate

498 pages

12h 3m

English

Packt Publishing

Read now

Unlock full access

Preface
PART I: Data Ingestion
Tabular Formats
Tidying UpCSVSanity ChecksThe Good, the Bad, and the Textual DataThe BadThe GoodSpreadsheets Considered HarmfulSQL RDBMSMassaging Data TypesRepeating in RWhere SQL Goes Wrong (and How to Notice It)Other FormatsHDF5 and NetCDF-4Tools and LibrariesSQLiteApache ParquetData FramesSpark/ScalaPandas and Derived WrappersVaexData Frames in R (Tidyverse)Data Frames in R (data.table)Bash for FunExercisesTidy Data from ExcelTidy Data from SQLDenouement
Hierarchical Formats
JSONWhat JSON Looks LikeNaN Handling and Data TypesJSON LinesGeoJSONTidy GeographyJSON SchemaXMLUser RecordsKeyhole Markup LanguageConfiguration FilesINI and Flat Custom FormatsTOMLYet Another Markup LanguageNoSQL DatabasesDocument-Oriented DatabasesMissing FieldsDenormalization and Its DiscontentsKey/Value StoresExercisesExploring Filled AreaCreate a Relational ModelDenouement
Repurposing Data Sources
Web ScrapingHTML TablesNon-Tabular DataCommand-Line ScrapingPortable Document FormatImage FormatsPixel StatisticsChannel ManipulationMetadataBinary Serialized Data StructuresCustom Text FormatsA Structured LogCharacter EncodingsExercisesEnhancing the NPY ParserScraping Web TrafficDenouement
PART II: The Vicissitudes of Error
Anomaly Detection
Missing DataSQLHierarchical FormatsSentinelsMiscoded DataFixed BoundsOutliersZ-ScoreInterquartile RangeMultivariate OutliersExercisesA Famous ExperimentMisspelled WordsDenouement
Data Quality
Missing DataBiasing TrendsUnderstanding BiasDetecting BiasComparison to BaselinesBenford’s LawClass ImbalanceNormalization and ScalingApplying a Machine Learning ModelScaling TechniquesFactor and Sample WeightingCyclicity and AutocorrelationDomain Knowledge TrendsDiscovered CyclesBespoke ValidationCollation ValidationTranscription ValidationExercisesData CharacterizationOversampled PollsDenouement
PART III: Rectification and Creation
Value Imputation
Typical-Value ImputationTypical Tabular DataLocality ImputationTrend ImputationTypes of TrendsA Larger Coarse Time SeriesUnderstanding the DataRemoving Unusable DataImputing ConsistencyInterpolationNon-Temporal TrendsSamplingUndersamplingOversamplingExercisesAlternate Trend ImputationBalancing Multiple FeaturesDenouement

Feature Engineering
Date/Time FieldsCreating DatetimesImposing RegularityDuplicated TimestampsAdding TimestampsString FieldsFuzzy MatchingExplicit CategoriesString VectorsDecompositionsRotation and WhiteningDimensionality ReductionVisualizationQuantization and BinarizationOne-Hot EncodingPolynomial FeaturesGenerating Synthetic FeaturesFeature SelectionExercisesIntermittent OccurrencesCharacterizing LevelsDenouement
PART IV: Ancillary Matters
Closure
What You KnowWhat You Don’t Know (Yet)
Glossary
Why subscribe?
Other Books You May Enjoy
Index

Overview

Dive into the intricacies of data cleaning, a crucial aspect of any data science and machine learning pipeline, with 'Cleaning Data for Effective Data Science.' This comprehensive guide walks you through tools and methodologies like Python, R, and command-line utilities to prepare raw data for analysis. Learn practical strategies to manage, clean, and refine data encountered in the real world.

What this Book will help me do

Understand and utilize various data formats such as JSON, SQL, and PDF for data ingestion and processing.
Master key tools like pandas, SciPy, and Tidyverse to manipulate and analyze datasets efficiently.
Develop heuristics and methodologies for assessing data quality, detecting bias, and identifying irregularities.
Apply advanced techniques like feature engineering and statistical adjustments to enhance data usability.
Gain confidence in handling time series data by employing methods for de-trending and interpolating missing values.

Author(s)

David Mertz has years of experience as a Python programmer and data scientist. Known for his engaging and accessible teaching style, David has authored numerous technical articles and books. He emphasizes not only the technicalities of data science tools but also the critical thinking that approaches solutions creatively and effectively.

Who is it for?

'Cleaning Data for Effective Data Science' is designed for data scientists, software developers, and educators dealing with data preparation. Whether you're an aspiring data enthusiast or an experienced professional looking to refine your skills, this book provides essential tools and frameworks. Prior programming knowledge, particularly in Python or R, coupled with an understanding of statistical fundamentals, will help you make the most of this resource.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781801071291

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills