Practical Python Data Wrangling and Data Quality

Book description

The world around us is full of data that holds unique insights and valuable stories, and this book will help you uncover them. Whether you already work with data or want to learn more about its possibilities, the examples and techniques in this practical book will help you more easily clean, evaluate, and analyze data so that you can generate meaningful insights and compelling visualizations.

Complementing foundational concepts with expert advice, author Susan E. McGregor provides the resources you need to extract, evaluate, and analyze a wide variety of data sources and formats, along with the tools to communicate your findings effectively. This book delivers a methodical, jargon-free way for data practitioners at any level, from true novices to seasoned professionals, to harness the power of data.

  • Use Python 3.8+ to read, write, and transform data from a variety of sources
  • Understand and use programming basics in Python to wrangle data at scale
  • Organize, document, and structure your code using best practices
  • Collect data from structured data files, web pages, and APIs
  • Perform basic statistical analyses to make meaning from datasets
  • Visualize and present data in clear and compelling ways

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book?
    2. Who Shouldn’t Read This Book?
    3. What to Expect from This Volume
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Introduction to Data Wrangling and Data Quality
    1. What Is “Data Wrangling”?
    2. What Is “Data Quality”?
      1. Data Integrity
      2. Data “Fit”
    3. Why Python?
      1. Versatility
      2. Accessibility
      3. Readability
      4. Community
      5. Python Alternatives
    4. Writing and “Running” Python
    5. Working with Python on Your Own Device
      1. Getting Started with the Command Line
      2. Installing Python, Jupyter Notebook, and a Code Editor
    6. Working with Python Online
    7. Hello World!
      1. Using Atom to Create a Standalone Python File
      2. Using Jupyter to Create a New Python Notebook
      3. Using Google Colab to Create a New Python Notebook
    8. Adding the Code
      1. In a Standalone File
      2. In a Notebook
    9. Running the Code
      1. In a Standalone File
      2. In a Notebook
    10. Documenting, Saving, and Versioning Your Work
      1. Documenting
      2. Saving
      3. Versioning
    11. Conclusion
  3. 2. Introduction to Python
    1. The Programming “Parts of Speech”
      1. Nouns ≈ Variables
      2. Verbs ≈ Functions
      3. Cooking with Custom Functions
      4. Libraries: Borrowing Custom Functions from Other Coders
    2. Taking Control: Loops and Conditionals
      1. In the Loop
      2. One Condition…
    3. Understanding Errors
      1. Syntax Snafus
      2. Runtime Runaround
      3. Logic Loss
    4. Hitting the Road with Citi Bike Data
      1. Starting with Pseudocode
      2. Seeking Scale
    5. Conclusion
  4. 3. Understanding Data Quality
    1. Assessing Data Fit
      1. Validity
      2. Reliability
      3. Representativeness
    2. Assessing Data Integrity
      1. Necessary, but Not Sufficient
      2. Important
      3. Achievable
    3. Improving Data Quality
      1. Data Cleaning
      2. Data Augmentation
    4. Conclusion
  5. 4. Working with File-Based and Feed-Based Data in Python
    1. Structured Versus Unstructured Data
    2. Working with Structured Data
      1. File-Based, Table-Type Data—Take It to Delimit
      2. Wrangling Table-Type Data with Python
    3. Real-World Data Wrangling: Understanding Unemployment
      1. XLSX, ODS, and All the Rest
      2. Finally, Fixed-Width
      3. Feed-Based Data—Web-Driven Live Updates
      4. Wrangling Feed-Type Data with Python
    4. Working with Unstructured Data
      1. Image-Based Text: Accessing Data in PDFs
      2. Wrangling PDFs with Python
      3. Accessing PDF Tables with Tabula
    5. Conclusion
  6. 5. Accessing Web-Based Data
    1. Accessing Online XML and JSON
    2. Introducing APIs
    3. Basic APIs: A Search Engine Example
    4. Specialized APIs: Adding Basic Authentication
      1. Getting a FRED API Key
      2. Using Your API key to Request Data
    5. Reading API Documentation
    6. Protecting Your API Key When Using Python
      1. Creating Your “Credentials” File
      2. Using Your Credentials in a Separate Script
      3. Getting Started with .gitignore
    7. Specialized APIs: Working With OAuth
      1. Applying for a Twitter Developer Account
      2. Creating Your Twitter “App” and Credentials
      3. Encoding Your API Key and Secret
      4. Requesting an Access Token and Data from the Twitter API
    8. API Ethics
    9. Web Scraping: The Data Source of Last Resort
      1. Carefully Scraping the MTA
      2. Using Browser Inspection Tools
      3. The Python Web Scraping Solution: Beautiful Soup
    10. Conclusion
  7. 6. Assessing Data Quality
    1. The Pandemic and the PPP
    2. Assessing Data Integrity
      1. Is It of Known Pedigree?
      2. Is It Timely?
      3. Is It Complete?
      4. Is It Well-Annotated?
      5. Is It High Volume?
      6. Is It Consistent?
      7. Is It Multivariate?
      8. Is It Atomic?
      9. Is It Clear?
      10. Is It Dimensionally Structured?
    3. Assessing Data Fit
      1. Validity
      2. Reliability
      3. Representativeness
    4. Conclusion
  8. 7. Cleaning, Transforming, and Augmenting Data
    1. Selecting a Subset of Citi Bike Data
      1. A Simple Split
      2. Regular Expressions: Supercharged String Matching
      3. Making a Date
    2. De-crufting Data Files
    3. Decrypting Excel Dates
    4. Generating True CSVs from Fixed-Width Data
    5. Correcting for Spelling Inconsistencies
    6. The Circuitous Path to “Simple” Solutions
    7. Gotchas That Will Get Ya!
    8. Augmenting Your Data
    9. Conclusion
  9. 8. Structuring and Refactoring Your Code
    1. Revisiting Custom Functions
      1. Will You Use It More Than Once?
      2. Is It Ugly and Confusing?
      3. Do You Just Really Hate the Default Functionality?
    2. Understanding Scope
    3. Defining the Parameters for Function “Ingredients”
      1. What Are Your Options?
      2. Getting Into Arguments?
    4. Return Values
    5. Climbing the “Stack”
    6. Refactoring for Fun and Profit
      1. A Function for Identifying Weekdays
      2. Metadata Without the Mess
    7. Documenting Your Custom Scripts and Functions with pydoc
    8. The Case for Command-Line Arguments
    9. Where Scripts and Notebooks Diverge
    10. Conclusion
  10. 9. Introduction to Data Analysis
    1. Context Is Everything
    2. Same but Different
    3. What’s Typical? Evaluating Central Tendency
      1. What’s That Mean?
      2. Embrace the Median
    4. Think Different: Identifying Outliers
    5. Visualization for Data Analysis
      1. What’s Our Data’s Shape? Understanding Histograms
      2. The Significance of Symmetry
      3. Counting “Clusters”
    6. The $2 Million Question
    7. Proportional Response
    8. Conclusion
  11. 10. Presenting Your Data
    1. Foundations for Visual Eloquence
    2. Making Your Data Statement
    3. Charts, Graphs, and Maps: Oh My!
      1. Pie Charts
      2. Bar and Column Charts
      3. Line Charts
      4. Scatter Charts
      5. Maps
    4. Elements of Eloquent Visuals
      1. The “Finicky” Details Really Do Make a Difference
      2. Trust Your Eyes (and the Experts)
      3. Selecting Scales
      4. Choosing Colors
      5. Above All, Annotate!
    5. From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlib
    6. Beyond the Basics
    7. Conclusion
  12. 11. Beyond Python
    1. Additional Tools for Data Review
      1. Spreadsheet Programs
      2. OpenRefine
    2. Additional Tools for Sharing and Presenting Data
      1. Image Editing for JPGs, PNGs, and GIFs
      2. Software for Editing SVGs and Other Vector Formats
    3. Reflecting on Ethics
    4. Conclusion
  13. A. More Python Programming Resources
    1. Official Python Documentation
    2. Installing Python Resources
      1. Where to Look for Libraries
    3. Keeping Your Tools Sharp
    4. Where to Learn More
  14. B. A Bit More About Git
    1. You Run git push/pull and End Up in a Weird Text Editor
    2. Your git push/pull Command Gets Rejected
      1. Run git pull
    3. Git Quick Reference
  15. C. Finding Data
    1. Data Repositories and APIs
    2. Subject Matter Experts
    3. FOIA/L Requests
    4. Custom Data Collection
  16. D. Resources for Visualization and Information Design
    1. Foundational Books on Information Visualization
    2. The Quick Reference You’ll Reach For
    3. Sources of Inspiration
  17. Index
  18. About the Author

Product information

  • Title: Practical Python Data Wrangling and Data Quality
  • Author(s): Susan E. McGregor
  • Release date: December 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492091509