book

Practical Python Data Wrangling and Data Quality

Name: Practical Python Data Wrangling and Data Quality
Author: Susan E. McGregor
ISBN: 9781492091509

by Susan E. McGregor

December 2021

Beginner to intermediate

413 pages

11h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This Book?Who Shouldn’t Read This Book?What to Expect from This VolumeConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Data Wrangling and Data Quality
What Is “Data Wrangling”?What Is “Data Quality”?Data IntegrityData “Fit”Why Python?VersatilityAccessibilityReadabilityCommunityPython AlternativesWriting and “Running” PythonWorking with Python on Your Own DeviceGetting Started with the Command LineInstalling Python, Jupyter Notebook, and a Code EditorWorking with Python OnlineHello World!Using Atom to Create a Standalone Python FileUsing Jupyter to Create a New Python NotebookUsing Google Colab to Create a New Python NotebookAdding the CodeIn a Standalone FileIn a NotebookRunning the CodeIn a Standalone FileIn a NotebookDocumenting, Saving, and Versioning Your WorkDocumentingSavingVersioningConclusion
2. Introduction to Python
The Programming “Parts of Speech”Nouns ≈ VariablesVerbs ≈ FunctionsCooking with Custom FunctionsLibraries: Borrowing Custom Functions from Other CodersTaking Control: Loops and ConditionalsIn the LoopOne Condition…Understanding ErrorsSyntax SnafusRuntime RunaroundLogic LossHitting the Road with Citi Bike DataStarting with PseudocodeSeeking ScaleConclusion
3. Understanding Data Quality
Assessing Data FitValidityReliabilityRepresentativenessAssessing Data IntegrityNecessary, but Not SufficientImportantAchievableImproving Data QualityData CleaningData AugmentationConclusion
4. Working with File-Based and Feed-Based Data in Python
Structured Versus Unstructured DataWorking with Structured DataFile-Based, Table-Type Data—Take It to DelimitWrangling Table-Type Data with PythonReal-World Data Wrangling: Understanding UnemploymentXLSX, ODS, and All the RestFinally, Fixed-WidthFeed-Based Data—Web-Driven Live UpdatesWrangling Feed-Type Data with PythonWorking with Unstructured DataImage-Based Text: Accessing Data in PDFsWrangling PDFs with PythonAccessing PDF Tables with TabulaConclusion
5. Accessing Web-Based Data
Accessing Online XML and JSONIntroducing APIsBasic APIs: A Search Engine ExampleSpecialized APIs: Adding Basic AuthenticationGetting a FRED API KeyUsing Your API key to Request DataReading API DocumentationProtecting Your API Key When Using PythonCreating Your “Credentials” FileUsing Your Credentials in a Separate ScriptGetting Started with .gitignoreSpecialized APIs: Working With OAuthApplying for a Twitter Developer AccountCreating Your Twitter “App” and CredentialsEncoding Your API Key and SecretRequesting an Access Token and Data from the Twitter APIAPI EthicsWeb Scraping: The Data Source of Last ResortCarefully Scraping the MTAUsing Browser Inspection ToolsThe Python Web Scraping Solution: Beautiful SoupConclusion
6. Assessing Data Quality
The Pandemic and the PPPAssessing Data IntegrityIs It of Known Pedigree?Is It Timely?Is It Complete?Is It Well-Annotated?Is It High Volume?Is It Consistent?Is It Multivariate?Is It Atomic?Is It Clear?Is It Dimensionally Structured?Assessing Data FitValidityReliabilityRepresentativenessConclusion
7. Cleaning, Transforming, and Augmenting Data
Selecting a Subset of Citi Bike DataA Simple SplitRegular Expressions: Supercharged String MatchingMaking a DateDe-crufting Data FilesDecrypting Excel DatesGenerating True CSVs from Fixed-Width DataCorrecting for Spelling InconsistenciesThe Circuitous Path to “Simple” SolutionsGotchas That Will Get Ya!Augmenting Your DataConclusion
8. Structuring and Refactoring Your Code
Revisiting Custom FunctionsWill You Use It More Than Once?Is It Ugly and Confusing?Do You Just Really Hate the Default Functionality?Understanding ScopeDefining the Parameters for Function “Ingredients”What Are Your Options?Getting Into Arguments?Return ValuesClimbing the “Stack”Refactoring for Fun and ProfitA Function for Identifying WeekdaysMetadata Without the MessDocumenting Your Custom Scripts and Functions with pydocThe Case for Command-Line ArgumentsWhere Scripts and Notebooks DivergeConclusion
9. Introduction to Data Analysis
Context Is EverythingSame but DifferentWhat’s Typical? Evaluating Central TendencyWhat’s That Mean?Embrace the MedianThink Different: Identifying OutliersVisualization for Data AnalysisWhat’s Our Data’s Shape? Understanding HistogramsThe Significance of SymmetryCounting “Clusters”The $2 Million QuestionProportional ResponseConclusion

10. Presenting Your Data
Foundations for Visual EloquenceMaking Your Data StatementCharts, Graphs, and Maps: Oh My!Pie ChartsBar and Column ChartsLine ChartsScatter ChartsMapsElements of Eloquent VisualsThe “Finicky” Details Really Do Make a DifferenceTrust Your Eyes (and the Experts)Selecting ScalesChoosing ColorsAbove All, Annotate!From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlibBeyond the BasicsConclusion
11. Beyond Python
Additional Tools for Data ReviewSpreadsheet ProgramsOpenRefineAdditional Tools for Sharing and Presenting DataImage Editing for JPGs, PNGs, and GIFsSoftware for Editing SVGs and Other Vector FormatsReflecting on EthicsConclusion
A. More Python Programming Resources
Official Python DocumentationInstalling Python ResourcesWhere to Look for LibrariesKeeping Your Tools SharpWhere to Learn More
B. A Bit More About Git
You Run git push/pull and End Up in a Weird Text EditorYour git push/pull Command Gets RejectedRun git pullGit Quick Reference
C. Finding Data
Data Repositories and APIsSubject Matter ExpertsFOIA/L RequestsCustom Data Collection
D. Resources for Visualization and Information Design
Foundational Books on Information VisualizationThe Quick Reference You’ll Reach ForSources of Inspiration
Index
About the Author

Content preview from Practical Python Data Wrangling and Data Quality

Chapter 6. Assessing Data Quality

Over the past two chapters, we’ve focused our efforts on identifying and accessing different formats of data in different locations—from spreadsheets to websites. But getting our hands on (potentially) interesting data is really only the beginning. The next step is conducting a thorough quality assessment to understand if what we have is useful, salvageable, or just straight up garbage.

As you may have gleaned from reading Chapter 3, crafting quality data is a complex and time-consuming business. The process is roughly equal parts research, experimentation, and dogged perseverance. Most importantly, committing to data quality means that you have to be willing to invest significant amounts of time and energy—and still be willing to throw it all out and start over if, despite your best efforts, the data you have just can’t be brought up to par.

When it comes down to it, in fact, that last criterion is probably what makes doing really high-quality, meaningful work with data truly difficult. The technical skills, as I hope you are already discovering, take some effort to master but are still highly achievable with sufficient practice. Research skills are a bit harder to document and convey, but working through the examples in this book will help you develop many of them, especially those related to the information discovery and collation needed for assessing and improving data quality.

When it comes to reconciling yourself to the fact that, after ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492091493Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Python Data Wrangling and Data Quality

by Susan E. McGregor

Chapter 6. Assessing Data Quality

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.