book

Practical Python Data Wrangling and Data Quality

Name: Practical Python Data Wrangling and Data Quality
Author: Susan E. McGregor
ISBN: 9781492091509

by Susan E. McGregor

December 2021

Beginner to intermediate

413 pages

11h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This Book?Who Shouldn’t Read This Book?What to Expect from This VolumeConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Data Wrangling and Data Quality
What Is “Data Wrangling”?What Is “Data Quality”?Data IntegrityData “Fit”Why Python?VersatilityAccessibilityReadabilityCommunityPython AlternativesWriting and “Running” PythonWorking with Python on Your Own DeviceGetting Started with the Command LineInstalling Python, Jupyter Notebook, and a Code EditorWorking with Python OnlineHello World!Using Atom to Create a Standalone Python FileUsing Jupyter to Create a New Python NotebookUsing Google Colab to Create a New Python NotebookAdding the CodeIn a Standalone FileIn a NotebookRunning the CodeIn a Standalone FileIn a NotebookDocumenting, Saving, and Versioning Your WorkDocumentingSavingVersioningConclusion
2. Introduction to Python
The Programming “Parts of Speech”Nouns ≈ VariablesVerbs ≈ FunctionsCooking with Custom FunctionsLibraries: Borrowing Custom Functions from Other CodersTaking Control: Loops and ConditionalsIn the LoopOne Condition…Understanding ErrorsSyntax SnafusRuntime RunaroundLogic LossHitting the Road with Citi Bike DataStarting with PseudocodeSeeking ScaleConclusion
3. Understanding Data Quality
Assessing Data FitValidityReliabilityRepresentativenessAssessing Data IntegrityNecessary, but Not SufficientImportantAchievableImproving Data QualityData CleaningData AugmentationConclusion
4. Working with File-Based and Feed-Based Data in Python
Structured Versus Unstructured DataWorking with Structured DataFile-Based, Table-Type Data—Take It to DelimitWrangling Table-Type Data with PythonReal-World Data Wrangling: Understanding UnemploymentXLSX, ODS, and All the RestFinally, Fixed-WidthFeed-Based Data—Web-Driven Live UpdatesWrangling Feed-Type Data with PythonWorking with Unstructured DataImage-Based Text: Accessing Data in PDFsWrangling PDFs with PythonAccessing PDF Tables with TabulaConclusion
5. Accessing Web-Based Data
Accessing Online XML and JSONIntroducing APIsBasic APIs: A Search Engine ExampleSpecialized APIs: Adding Basic AuthenticationGetting a FRED API KeyUsing Your API key to Request DataReading API DocumentationProtecting Your API Key When Using PythonCreating Your “Credentials” FileUsing Your Credentials in a Separate ScriptGetting Started with .gitignoreSpecialized APIs: Working With OAuthApplying for a Twitter Developer AccountCreating Your Twitter “App” and CredentialsEncoding Your API Key and SecretRequesting an Access Token and Data from the Twitter APIAPI EthicsWeb Scraping: The Data Source of Last ResortCarefully Scraping the MTAUsing Browser Inspection ToolsThe Python Web Scraping Solution: Beautiful SoupConclusion
6. Assessing Data Quality
The Pandemic and the PPPAssessing Data IntegrityIs It of Known Pedigree?Is It Timely?Is It Complete?Is It Well-Annotated?Is It High Volume?Is It Consistent?Is It Multivariate?Is It Atomic?Is It Clear?Is It Dimensionally Structured?Assessing Data FitValidityReliabilityRepresentativenessConclusion
7. Cleaning, Transforming, and Augmenting Data
Selecting a Subset of Citi Bike DataA Simple SplitRegular Expressions: Supercharged String MatchingMaking a DateDe-crufting Data FilesDecrypting Excel DatesGenerating True CSVs from Fixed-Width DataCorrecting for Spelling InconsistenciesThe Circuitous Path to “Simple” SolutionsGotchas That Will Get Ya!Augmenting Your DataConclusion
8. Structuring and Refactoring Your Code
Revisiting Custom FunctionsWill You Use It More Than Once?Is It Ugly and Confusing?Do You Just Really Hate the Default Functionality?Understanding ScopeDefining the Parameters for Function “Ingredients”What Are Your Options?Getting Into Arguments?Return ValuesClimbing the “Stack”Refactoring for Fun and ProfitA Function for Identifying WeekdaysMetadata Without the MessDocumenting Your Custom Scripts and Functions with pydocThe Case for Command-Line ArgumentsWhere Scripts and Notebooks DivergeConclusion
9. Introduction to Data Analysis
Context Is EverythingSame but DifferentWhat’s Typical? Evaluating Central TendencyWhat’s That Mean?Embrace the MedianThink Different: Identifying OutliersVisualization for Data AnalysisWhat’s Our Data’s Shape? Understanding HistogramsThe Significance of SymmetryCounting “Clusters”The $2 Million QuestionProportional ResponseConclusion

10. Presenting Your Data
Foundations for Visual EloquenceMaking Your Data StatementCharts, Graphs, and Maps: Oh My!Pie ChartsBar and Column ChartsLine ChartsScatter ChartsMapsElements of Eloquent VisualsThe “Finicky” Details Really Do Make a DifferenceTrust Your Eyes (and the Experts)Selecting ScalesChoosing ColorsAbove All, Annotate!From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlibBeyond the BasicsConclusion
11. Beyond Python
Additional Tools for Data ReviewSpreadsheet ProgramsOpenRefineAdditional Tools for Sharing and Presenting DataImage Editing for JPGs, PNGs, and GIFsSoftware for Editing SVGs and Other Vector FormatsReflecting on EthicsConclusion
A. More Python Programming Resources
Official Python DocumentationInstalling Python ResourcesWhere to Look for LibrariesKeeping Your Tools SharpWhere to Learn More
B. A Bit More About Git
You Run git push/pull and End Up in a Weird Text EditorYour git push/pull Command Gets RejectedRun git pullGit Quick Reference
C. Finding Data
Data Repositories and APIsSubject Matter ExpertsFOIA/L RequestsCustom Data Collection
D. Resources for Visualization and Information Design
Foundational Books on Information VisualizationThe Quick Reference You’ll Reach ForSources of Inspiration
Index
About the Author

Content preview from Practical Python Data Wrangling and Data Quality

Chapter 5. Accessing Web-Based Data

The internet is an incredible source of data; it is, arguably, the reason that data has become such a dominant part of our social, economic, political, and even creative lives. In Chapter 4, we focused our data wrangling efforts on the process of accessing and reformatting file-based data that had already been saved to our devices or to the cloud. At the same time, much of it came from the internet originally—whether it was downloaded from a website, like the unemployment data, or retrieved from a URL, like the Citi Bike data. Now that we have a handle on how to use Python to parse and transform a variety of file-based data formats, however, it’s time to look at what’s involved in collecting those files in the first place—especially when the data they contain is of the real-time, feed-based variety. To do this, we’re going to spend the bulk of this chapter learning how to get ahold of data made available through APIs—those application programming interfaces I mentioned early in Chapter 4. APIs are the primary (and sometimes only) way that we can access the data generated by real-time or on-demand services like social media platforms, streaming music, and search services—as well as many other private and public (e.g., government-generated) data sources.

While the many benefits of APIs (see “Why APIs?” for a refresher) make them a popular resource for data-collecting companies to offer, there are significant costs and risks to doing so. For advertising-driven ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492091493Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Python Data Wrangling and Data Quality

by Susan E. McGregor

Chapter 5. Accessing Web-Based Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.