book

Practical Python Data Wrangling and Data Quality

Name: Practical Python Data Wrangling and Data Quality
Author: Susan E. McGregor
ISBN: 9781492091509

by Susan E. McGregor

December 2021

Beginner to intermediate

413 pages

11h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This Book?Who Shouldn’t Read This Book?What to Expect from This VolumeConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Data Wrangling and Data Quality
What Is “Data Wrangling”?What Is “Data Quality”?Data IntegrityData “Fit”Why Python?VersatilityAccessibilityReadabilityCommunityPython AlternativesWriting and “Running” PythonWorking with Python on Your Own DeviceGetting Started with the Command LineInstalling Python, Jupyter Notebook, and a Code EditorWorking with Python OnlineHello World!Using Atom to Create a Standalone Python FileUsing Jupyter to Create a New Python NotebookUsing Google Colab to Create a New Python NotebookAdding the CodeIn a Standalone FileIn a NotebookRunning the CodeIn a Standalone FileIn a NotebookDocumenting, Saving, and Versioning Your WorkDocumentingSavingVersioningConclusion
2. Introduction to Python
The Programming “Parts of Speech”Nouns ≈ VariablesVerbs ≈ FunctionsCooking with Custom FunctionsLibraries: Borrowing Custom Functions from Other CodersTaking Control: Loops and ConditionalsIn the LoopOne Condition…Understanding ErrorsSyntax SnafusRuntime RunaroundLogic LossHitting the Road with Citi Bike DataStarting with PseudocodeSeeking ScaleConclusion
3. Understanding Data Quality
Assessing Data FitValidityReliabilityRepresentativenessAssessing Data IntegrityNecessary, but Not SufficientImportantAchievableImproving Data QualityData CleaningData AugmentationConclusion
4. Working with File-Based and Feed-Based Data in Python
Structured Versus Unstructured DataWorking with Structured DataFile-Based, Table-Type Data—Take It to DelimitWrangling Table-Type Data with PythonReal-World Data Wrangling: Understanding UnemploymentXLSX, ODS, and All the RestFinally, Fixed-WidthFeed-Based Data—Web-Driven Live UpdatesWrangling Feed-Type Data with PythonWorking with Unstructured DataImage-Based Text: Accessing Data in PDFsWrangling PDFs with PythonAccessing PDF Tables with TabulaConclusion
5. Accessing Web-Based Data
Accessing Online XML and JSONIntroducing APIsBasic APIs: A Search Engine ExampleSpecialized APIs: Adding Basic AuthenticationGetting a FRED API KeyUsing Your API key to Request DataReading API DocumentationProtecting Your API Key When Using PythonCreating Your “Credentials” FileUsing Your Credentials in a Separate ScriptGetting Started with .gitignoreSpecialized APIs: Working With OAuthApplying for a Twitter Developer AccountCreating Your Twitter “App” and CredentialsEncoding Your API Key and SecretRequesting an Access Token and Data from the Twitter APIAPI EthicsWeb Scraping: The Data Source of Last ResortCarefully Scraping the MTAUsing Browser Inspection ToolsThe Python Web Scraping Solution: Beautiful SoupConclusion
6. Assessing Data Quality
The Pandemic and the PPPAssessing Data IntegrityIs It of Known Pedigree?Is It Timely?Is It Complete?Is It Well-Annotated?Is It High Volume?Is It Consistent?Is It Multivariate?Is It Atomic?Is It Clear?Is It Dimensionally Structured?Assessing Data FitValidityReliabilityRepresentativenessConclusion
7. Cleaning, Transforming, and Augmenting Data
Selecting a Subset of Citi Bike DataA Simple SplitRegular Expressions: Supercharged String MatchingMaking a DateDe-crufting Data FilesDecrypting Excel DatesGenerating True CSVs from Fixed-Width DataCorrecting for Spelling InconsistenciesThe Circuitous Path to “Simple” SolutionsGotchas That Will Get Ya!Augmenting Your DataConclusion
8. Structuring and Refactoring Your Code
Revisiting Custom FunctionsWill You Use It More Than Once?Is It Ugly and Confusing?Do You Just Really Hate the Default Functionality?Understanding ScopeDefining the Parameters for Function “Ingredients”What Are Your Options?Getting Into Arguments?Return ValuesClimbing the “Stack”Refactoring for Fun and ProfitA Function for Identifying WeekdaysMetadata Without the MessDocumenting Your Custom Scripts and Functions with pydocThe Case for Command-Line ArgumentsWhere Scripts and Notebooks DivergeConclusion
9. Introduction to Data Analysis
Context Is EverythingSame but DifferentWhat’s Typical? Evaluating Central TendencyWhat’s That Mean?Embrace the MedianThink Different: Identifying OutliersVisualization for Data AnalysisWhat’s Our Data’s Shape? Understanding HistogramsThe Significance of SymmetryCounting “Clusters”The $2 Million QuestionProportional ResponseConclusion

10. Presenting Your Data
Foundations for Visual EloquenceMaking Your Data StatementCharts, Graphs, and Maps: Oh My!Pie ChartsBar and Column ChartsLine ChartsScatter ChartsMapsElements of Eloquent VisualsThe “Finicky” Details Really Do Make a DifferenceTrust Your Eyes (and the Experts)Selecting ScalesChoosing ColorsAbove All, Annotate!From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlibBeyond the BasicsConclusion
11. Beyond Python
Additional Tools for Data ReviewSpreadsheet ProgramsOpenRefineAdditional Tools for Sharing and Presenting DataImage Editing for JPGs, PNGs, and GIFsSoftware for Editing SVGs and Other Vector FormatsReflecting on EthicsConclusion
A. More Python Programming Resources
Official Python DocumentationInstalling Python ResourcesWhere to Look for LibrariesKeeping Your Tools SharpWhere to Learn More
B. A Bit More About Git
You Run git push/pull and End Up in a Weird Text EditorYour git push/pull Command Gets RejectedRun git pullGit Quick Reference
C. Finding Data
Data Repositories and APIsSubject Matter ExpertsFOIA/L RequestsCustom Data Collection
D. Resources for Visualization and Information Design
Foundational Books on Information VisualizationThe Quick Reference You’ll Reach ForSources of Inspiration
Index
About the Author

Content preview from Practical Python Data Wrangling and Data Quality

Chapter 3. Understanding Data Quality

Data is everywhere. It’s automatically generated by our mobile devices, our shopping activities, and our physical movements. It’s captured by our electric meters, public transportation systems, and communications infrastructure. And it’s used to estimate our health outcomes, our earning potential, and our credit worthiness.¹ Economists have even declared that data is the “new oil,”² given its potential to transform so many aspects of human life.

While data may be plentiful, however, the truth is that good data is scarce. The claim of “the data revolution” is that, with enough data, we can better understand the present and improve—or even predict—the future. For any of that to even be possible, however, the data underlying those insights has to be high quality. Without good-quality data, all of our efforts to wrangle, analyze, visualize, and communicate it will, at best, leave us with no more insight about the world than when we started. While that would be an unfortunate waste of effort, the consequences of failing to recognize that we have poor-quality data is even worse, because it can lead us to develop a seemingly rational but dangerously distorted view of reality. What’s more, because data-driven systems are used to make decisions at scale, the harms caused by even a small amount of bad data can be significant. Sure, data about hundreds or even thousands of people may be used to “train” a machine learning model. But if that data is not ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492091493Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Python Data Wrangling and Data Quality

by Susan E. McGregor

Chapter 3. Understanding Data Quality

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.