O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Wrangling with Python

Book Description

Simplify your ETL processes with these hands-on data hygiene tips, tricks, and best practices.

Key Features

  • Focus on the basics of data wrangling
  • Study various ways to extract the most out of your data in less time
  • Boost your learning curve with bonus topics like random data generation and data integrity checks

Book Description

For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you the core ideas behind these processes and equips you with knowledge of the most popular tools and techniques in the domain.

The book starts with the absolute basics of Python, focusing mainly on data structures. It then delves into the fundamental tools of data wrangling like NumPy and Pandas libraries. You'll explore useful insights into why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. This combination of Python tips and tricks will also demonstrate how to use the same Python backend and extract/transform data from an array of sources including the Internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, you'll cover how to handle missing or wrong data, and reformat it based on the requirements from the downstream analytics tool. The book will further help you grasp concepts through real-world examples and datasets.

By the end of this book, you will be confident in using a diverse array of sources to extract, clean, transform, and format your data efficiently.

What you will learn

  • Use and manipulate complex and simple data structures
  • Harness the full potential of DataFrames and numpy.array at run time
  • Perform web scraping with BeautifulSoup4 and html5lib
  • Execute advanced string search and manipulation with RegEX
  • Handle outliers and perform data imputation with Pandas
  • Use descriptive statistics and plotting techniques
  • Practice data wrangling and modeling using data generation techniques

Who this book is for

Data Wrangling with Python is designed for developers, data analysts, and business analysts who are keen to pursue a career as a full-fledged data scientist or analytics expert. Although, this book is for beginners, prior working knowledge of Python is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational database and SQL.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Learning Objectives
      3. Approach
      4. Audience
      5. Minimum Hardware Requirements
      6. Software Requirements
      7. Conventions
      8. Installation and Setup
      9. Installing the Code Bundle
      10. Additional Resources
  2. Chapter 1
  3. Introduction to Data Wrangling with Python
    1. Introduction
      1. Importance of Data Wrangling
    2. Python for Data Wrangling
    3. Lists, Sets, Strings, Tuples, and Dictionaries
      1. Lists
      2. Exercise 1: Accessing the List Members
      3. Exercise 2: Generating a List
      4. Exercise 3: Iterating over a List and Checking Membership
      5. Exercise 4: Sorting a List
      6. Exercise 5: Generating a Random List
      7. Activity 1: Handling Lists
      8. Sets
      9. Introduction to Sets
      10. Union and Intersection of Sets
      11. Creating Null Sets
      12. Dictionary
      13. Exercise 6: Accessing and Setting Values in a Dictionary
      14. Exercise 7: Iterating Over a Dictionary
      15. Exercise 8: Revisiting the Unique Valued List Problem
      16. Exercise 9: Deleting Value from Dict
      17. Exercise 10: Dictionary Comprehension
      18. Tuples
      19. Creating a Tuple with Different Cardinalities
      20. Unpacking a Tuple
      21. Exercise 11: Handling Tuples
      22. Strings
      23. Exercise 12: Accessing Strings
      24. Exercise 13: String Slices
      25. String Functions
      26. Exercise 14: Split and Join
      27. Activity 2: Analyze a Multiline String and Generate the Unique Word Count
    4. Summary
  4. Chapter 2
  5. Advanced Data Structures and File Handling
    1. Introduction
    2. Advanced Data Structures
      1. Iterator
      2. Exercise 15: Introduction to the Iterator
      3. Stacks
      4. Exercise 16: Implementing a Stack in Python
      5. Exercise 17: Implementing a Stack Using User-Defined Methods
      6. Exercise 18: Lambda Expression
      7. Exercise 19: Lambda Expression for Sorting
      8. Exercise 20: Multi-Element Membership Checking
      9. Queue
      10. Exercise 21: Implementing a Queue in Python
      11. Activity 3: Permutation, Iterator, Lambda, List
    3. Basic File Operations in Python
      1. Exercise 22: File Operations
      2. File Handling
      3. Exercise 23: Opening and Closing a File
      4. The with Statement
      5. Opening a File Using the with Statement
      6. Exercise 24: Reading a File Line by Line
      7. Exercise 25: Write to a File
      8. Activity 4: Design Your Own CSV Parser
    4. Summary
  6. Chapter 3
  7. Introduction to NumPy, Pandas,and Matplotlib
    1. Introduction
    2. NumPy Arrays
      1. NumPy Array and Features
      2. Exercise 26: Creating a NumPy Array (from a List)
      3. Exercise 27: Adding Two NumPy Arrays
      4. Exercise 28: Mathematical Operations on NumPy Arrays
      5. Exercise 29: Advanced Mathematical Operations on NumPy Arrays
      6. Exercise 30: Generating Arrays Using arange and linspace
      7. Exercise 31: Creating Multi-Dimensional Arrays
      8. Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array
      9. Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors
      10. Exercise 34: Reshaping, Ravel, Min, Max, and Sorting
      11. Exercise 35: Indexing and Slicing
      12. Conditional Subsetting
      13. Exercise 36: Array Operations (array-array, array-scalar, and universal functions)
      14. Stacking Arrays
    3. Pandas DataFrames
      1. Exercise 37: Creating a Pandas Series
      2. Exercise 38: Pandas Series and Data Handling
      3. Exercise 39: Creating Pandas DataFrames
      4. Exercise 40: Viewing a DataFrame Partially
      5. Indexing and Slicing Columns
      6. Indexing and Slicing Rows
      7. Exercise 41: Creating and Deleting a New Column or Row
    4. Statistics and Visualization with NumPy and Pandas
      1. Refresher of Basic Descriptive Statistics (and the Matplotlib Library for Visualization)
      2. Exercise 42: Introduction to Matplotlib Through a Scatter Plot
      3. Definition of Statistical Measures – Central Tendency and Spread
      4. Random Variables and Probability Distribution
      5. What Is a Probability Distribution?
      6. Discrete Distributions
      7. Continuous Distributions
      8. Data Wrangling in Statistics and Visualization
      9. Using NumPy and Pandas to Calculate Basic Descriptive Statistics on the DataFrame
      10. Random Number Generation Using NumPy
      11. Exercise 43: Generating Random Numbers from a Uniform Distribution
      12. Exercise 44: Generating Random Numbers from a Binomial Distribution and Bar Plot
      13. Exercise 45: Generating Random Numbers from Normal Distribution and Histograms
      14. Exercise 46: Calculation of Descriptive Statistics from a DataFrame
      15. Exercise 47: Built-in Plotting Utilities
      16. Activity 5: Generating Statistics from a CSV File
    5. Summary
  8. Chapter 4
  9. A Deep Dive into Data Wrangling with Python
    1. Introduction
    2. Subsetting, Filtering, and Grouping
      1. Exercise 48: Loading and Examining a Superstore's Sales Data from an Excel File
      2. Subsetting the DataFrame
      3. An Example Use Case: Determining Statistics on Sales and Profit
      4. Exercise 49: The unique Function
      5. Conditional Selection and Boolean Filtering
      6. Exercise 50: Setting and Resetting the Index
      7. Exercise 51: The GroupBy Method
    3. Detecting Outliers and Handling Missing Values
      1. Missing Values in Pandas
      2. Exercise 52: Filling in the Missing Values with fillna
      3. Exercise 53: Dropping Missing Values with dropna
      4. Outlier Detection Using a Simple Statistical Test
    4. Concatenating, Merging, and Joining
      1. Exercise 54: Concatenation
      2. Exercise 55: Merging by a Common Key
      3. Exercise 56: The join Method
    5. Useful Methods of Pandas
      1. Exercise 57: Randomized Sampling
      2. The value_counts Method
      3. Pivot Table Functionality
      4. Exercise 58: Sorting by Column Values – the sort_values Method
      5. Exercise 59: Flexibility for User-Defined Functions with the apply Method
      6. Activity 6: Working with the Adult Income Dataset (UCI)
    6. Summary
  10. Chapter 5
  11. Getting Comfortable with Different Kinds of Data Sources
    1. Introduction
    2. Reading Data from Different Text-Based (and Non-Text-Based) Sources
      1. Data Files Provided with This Chapter
      2. Libraries to Install for This Chapter
      3. Exercise 60: Reading Data from a CSV File Where Headers Are Missing
      4. Exercise 61: Reading from a CSV File where Delimiters are not Commas
      5. Exercise 62: Bypassing the Headers of a CSV File
      6. Exercise 63: Skipping Initial Rows and Footers when Reading a CSV File
      7. Reading Only the First N Rows (Especially Useful for Large Files)
      8. Exercise 64: Combining Skiprows and Nrows to Read Data in Small Chunks
      9. Setting the skip_blank_lines Option
      10. Read CSV from a Zip file
      11. Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name
      12. Exercise 65: Reading a General Delimited Text File
      13. Reading HTML Tables Directly from a URL
      14. Exercise 66: Further Wrangling to Get the Desired Data
      15. Exercise 67: Reading from a JSON File
      16. Reading a Stata File
      17. Exercise 68: Reading Tabular Data from a PDF File
    3. Introduction to Beautiful Soup 4 and Web Page Parsing
      1. Structure of HTML
      2. Exercise 69: Reading an HTML file and Extracting its Contents Using BeautifulSoup
      3. Exercise 70: DataFrames and BeautifulSoup
      4. Exercise 71: Exporting a DataFrame as an Excel File
      5. Exercise 72: Stacking URLs from a Document using bs4
      6. Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames
    4. Summary
  12. Chapter 6
  13. Learning the Hidden Secrets of Data Wrangling
    1. Introduction
      1. Additional Software Required for This Section
    2. Advanced List Comprehension and the zip Function
      1. Introduction to Generator Expressions
      2. Exercise 73: Generator Expressions
      3. Exercise 74: One-Liner Generator Expression
      4. Exercise 75: Extracting a List with Single Words
      5. Exercise 76: The zip Function
      6. Exercise 77: Handling Messy Data
    3. Data Formatting
      1. The % operator
      2. Using the format Function
      3. Exercise 78: Data Representation Using {}
    4. Identify and Clean Outliers
      1. Exercise 79: Outliers in Numerical Data
      2. Z-score
      3. Exercise 80: The Z-Score Value to Remove Outliers
      4. Exercise 81: Fuzzy Matching of Strings
    5. Activity 8: Handling Outliers and Missing Data
    6. Summary
  14. Chapter 7
  15. Advanced Web Scraping and Data Gathering
    1. Introduction
    2. The Basics of Web Scraping and the Beautiful Soup Library
      1. Libraries in Python
      2. Exercise 81: Using the Requests Library to Get a Response from the Wikipedia Home Page
      3. Exercise 82: Checking the Status of the Web Request
      4. Checking the Encoding of the Web Page
      5. Exercise 83: Creating a Function to Decode the Contents of the Response and Check its Length
      6. Exercise 84: Extracting Human-Readable Text From a BeautifulSoup Object
      7. Extracting Text from a Section
      8. Extracting Important Historical Events that Happened on Today's Date
      9. Exercise 85: Using Advanced BS4 Techniques to Extract Relevant Text
      10. Exercise 86: Creating a Compact Function to Extract the "On this Day" Text from the Wikipedia Home Page
    3. Reading Data from XML
      1. Exercise 87: Creating an XML File and Reading XML Element Objects
      2. Exercise 88: Finding Various Elements of Data within a Tree (Element)
      3. Reading from a Local XML File into an ElementTree Object
      4. Exercise 89: Traversing the Tree, Finding the Root, and Exploring all Child Nodes and their Tags and Attributes
      5. Exercise 90: Using the text Method to Extract Meaningful Data
      6. Extracting and Printing the GDP/Per Capita Information Using a Loop
      7. Exercise 91: Finding All the Neighboring Countries for each Country and Printing Them
      8. Exercise 92: A Simple Demo of Using XML Data Obtained by Web Scraping
    4. Reading Data from an API
      1. Defining the Base URL (or API Endpoint)
      2. Exercise 93: Defining and Testing a Function to Pull Country Data from an API
      3. Using the Built-In JSON Library to Read and Examine Data
      4. Printing All the Data Elements
      5. Using a Function that Extracts a DataFrame Containing Key Information
      6. Exercise 94: Testing the Function by Building a Small Database of Countries' Information
    5. Fundamentals of Regular Expressions (RegEx)
      1. Regex in the Context of Web Scraping
      2. Exercise 95: Using the match Method to Check Whether a Pattern matches a String/Sequence
      3. Using the Compile Method to Create a Regex Program
      4. Exercise 96: Compiling Programs to Match Objects
      5. Exercise 97: Using Additional Parameters in Match to Check for Positional Matching
      6. Finding the Number of Words in a List That End with "ing"
      7. Exercise 98: The search Method in Regex
      8. Exercise 99: Using the span Method of the Match Object to Locate the Position of the Matched Pattern
      9. Exercise 100: Examples of Single Character Pattern Matching with search
      10. Exercise 101: Examples of Pattern Matching at the Start or End of a String
      11. Exercise 102: Examples of Pattern Matching with Multiple Characters
      12. Exercise 103: Greedy versus Non-Greedy Matching
      13. Exercise 104: Controlling Repetitions to Match
      14. Exercise 105: Sets of Matching Characters
      15. Exercise 106: The use of OR in Regex using the OR Operator
      16. The findall Method
      17. Activity 9: Extracting the Top 100 eBooks from Gutenberg
      18. Activity 10: Building Your Own Movie Database by Reading an API
    6. Summary
  16. Chapter 8
  17. RDBMS and SQL
    1. Introduction
    2. Refresher of RDBMS and SQL
      1. How is an RDBMS Structured?
      2. SQL
    3. Using an RDBMS (MySQL/PostgreSQL/SQLite)
      1. Exercise 107: Connecting to Database in SQLite
      2. Exercise 108: DDL and DML Commands in SQLite
      3. Reading Data from a Database in SQLite
      4. Exercise 109: Sorting Values that are Present in the Database
      5. Exercise 110: Altering the Structure of a Table and Updating the New Fields
      6. Exercise 111: Grouping Values in Tables
      7. Relation Mapping in Databases
      8. Adding Rows in the comments Table
      9. Joins
      10. Retrieving Specific Columns from a JOIN query
      11. Exercise 112: Deleting Rows
      12. Updating Specific Values in a Table
      13. Exercise 113: RDBMS and DataFrames
      14. Activity 11: Retrieving Data Correctly From Databases
    4. Summary
  18. Chapter 9
  19. Application of Data Wrangling in Real Life
    1. Introduction
    2. Applying Your Knowledge to a Real-life Data Wrangling Task
    3. Activity 12: Data Wrangling Task – Fixing UN Data
    4. Activity 13: Data Wrangling Task – Cleaning GDP Data
    5. Activity 14: Data Wrangling Task – Merging UN Data and GDP Data
    6. Activity 15: Data Wrangling Task – Connecting the New Data to the Database
    7. An Extension to Data Wrangling
      1. Additional Skills Required to Become a Data Scientist
      2. Basic Familiarity with Big Data and Cloud Technologies
      3. What Goes with Data Wrangling?
      4. Tips and Tricks for Mastering Machine Learning
    8. Summary
  20. Appendix
    1. Solution of Activity 1: Handling Lists
      1. Solution of Activity 2: Analyze a Multiline String and Generate the Unique Word Count
      2. Solution of Activity 3: Permutation, Iterator, Lambda, List
      3. Solution of Activity 4: Design Your Own CSV Parser
      4. Solution of Activity 5: Generating Statistics from a CSV File
      5. Solution of Activity 6: Working with the Adult Income Dataset (UCI)
      6. Solution of Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames
      7. Solution of Activity 8: Handling Outliers and Missing Data
      8. Solution of Activity 9: Extracting the Top 100 eBooks from Gutenberg
      9. Solution of Activity 10: Extracting the top 100 eBooks from Gutenberg.org
      10. Solution of Activity 11: Retrieving Data Correctly from Databases
      11. Solution of Activity 12: Data Wrangling Task – Fixing UN Data
      12. Activity 13: Data Wrangling Task – Cleaning GDP Data
      13. Solution of Activity 14: Data Wrangling Task – Merging UN Data and GDP Data
      14. Activity 15: Data Wrangling Task – Connecting the New Data to a Database