book

Cleaning Data for Effective Data Science

Name: Cleaning Data for Effective Data Science
Author: David Mertz
ISBN: 9781801071291

by David Mertz

March 2021

Beginner to intermediate

498 pages

12h 3m

English

Packt Publishing

Read now

Unlock full access

Preface
PART I: Data Ingestion
Tabular Formats
Tidying UpCSVSanity ChecksThe Good, the Bad, and the Textual DataThe BadThe GoodSpreadsheets Considered HarmfulSQL RDBMSMassaging Data TypesRepeating in RWhere SQL Goes Wrong (and How to Notice It)Other FormatsHDF5 and NetCDF-4Tools and LibrariesSQLiteApache ParquetData FramesSpark/ScalaPandas and Derived WrappersVaexData Frames in R (Tidyverse)Data Frames in R (data.table)Bash for FunExercisesTidy Data from ExcelTidy Data from SQLDenouement
Hierarchical Formats
JSONWhat JSON Looks LikeNaN Handling and Data TypesJSON LinesGeoJSONTidy GeographyJSON SchemaXMLUser RecordsKeyhole Markup LanguageConfiguration FilesINI and Flat Custom FormatsTOMLYet Another Markup LanguageNoSQL DatabasesDocument-Oriented DatabasesMissing FieldsDenormalization and Its DiscontentsKey/Value StoresExercisesExploring Filled AreaCreate a Relational ModelDenouement
Repurposing Data Sources
Web ScrapingHTML TablesNon-Tabular DataCommand-Line ScrapingPortable Document FormatImage FormatsPixel StatisticsChannel ManipulationMetadataBinary Serialized Data StructuresCustom Text FormatsA Structured LogCharacter EncodingsExercisesEnhancing the NPY ParserScraping Web TrafficDenouement
PART II: The Vicissitudes of Error
Anomaly Detection
Missing DataSQLHierarchical FormatsSentinelsMiscoded DataFixed BoundsOutliersZ-ScoreInterquartile RangeMultivariate OutliersExercisesA Famous ExperimentMisspelled WordsDenouement
Data Quality
Missing DataBiasing TrendsUnderstanding BiasDetecting BiasComparison to BaselinesBenford’s LawClass ImbalanceNormalization and ScalingApplying a Machine Learning ModelScaling TechniquesFactor and Sample WeightingCyclicity and AutocorrelationDomain Knowledge TrendsDiscovered CyclesBespoke ValidationCollation ValidationTranscription ValidationExercisesData CharacterizationOversampled PollsDenouement
PART III: Rectification and Creation
Value Imputation
Typical-Value ImputationTypical Tabular DataLocality ImputationTrend ImputationTypes of TrendsA Larger Coarse Time SeriesUnderstanding the DataRemoving Unusable DataImputing ConsistencyInterpolationNon-Temporal TrendsSamplingUndersamplingOversamplingExercisesAlternate Trend ImputationBalancing Multiple FeaturesDenouement

Feature Engineering
Date/Time FieldsCreating DatetimesImposing RegularityDuplicated TimestampsAdding TimestampsString FieldsFuzzy MatchingExplicit CategoriesString VectorsDecompositionsRotation and WhiteningDimensionality ReductionVisualizationQuantization and BinarizationOne-Hot EncodingPolynomial FeaturesGenerating Synthetic FeaturesFeature SelectionExercisesIntermittent OccurrencesCharacterizing LevelsDenouement
PART IV: Ancillary Matters
Closure
What You KnowWhat You Don’t Know (Yet)
Glossary
Why subscribe?
Other Books You May Enjoy
Index

Content preview from Cleaning Data for Effective Data Science

7 Feature Engineering

People come to me as a data scientist with their data. Then my job becomes part data-hazmat officer, part grief counselor.

–Anonymous

Chapter 6, Value Imputation looked at filling in missing values. In Chapter 5, Data Quality, we touched on normalization and scaling, which adjust values to artificially fit certain numeric or categorical patterns. Both of those earlier topics come close to the subject of this chapter, but here we focus more directly on the creation of synthetic features based on raw datasets. Whereas imputation is a matter of making reasonable guesses about what missing values might be, feature engineering is about changing the representational form of data, but in ways that are deterministic and often ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781801071291

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Cleaning Data for Effective Data Science

by David Mertz

7

Feature Engineering

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.