Book description
A beginner's guide to simplifying Extract, Transform, Load (ETL) processes with the help of hands-on tips, tricks, and best practices, in a fun and interactive way
Key Features
- Explore data wrangling with the help of real-world examples and business use cases
- Study various ways to extract the most value from your data in minimal time
- Boost your knowledge with bonus topics, such as random data generation and data integrity checks
Book Description
While a huge amount of data is readily available to us, it is not useful in its raw form. For data to be meaningful, it must be curated and refined.
If you're a beginner, then The Data Wrangling Workshop will help to break down the process for you. You'll start with the basics and build your knowledge, progressing from the core aspects behind data wrangling, to using the most popular tools and techniques.
This book starts by showing you how to work with data structures using Python. Through examples and activities, you'll understand why you should stay away from traditional methods of data cleaning used in other languages and take advantage of the specialized pre-built routines in Python. Later, you'll learn how to use the same Python backend to extract and transform data from an array of sources, including the internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, the book teaches you how to handle missing or incorrect data, and reformat it based on the requirements from your downstream analytics tool.
By the end of this book, you will have developed a solid understanding of how to perform data wrangling with Python, and learned several techniques and best practices to extract, clean, transform, and format your data efficiently, from a diverse array of sources.
What you will learn
- Get to grips with the fundamentals of data wrangling
- Understand how to model data with random data generation and data integrity checks
- Discover how to examine data with descriptive statistics and plotting techniques
- Explore how to search and retrieve information with regular expressions
- Delve into commonly-used Python data science libraries
- Become well-versed with how to handle and compensate for missing data
Who this book is for
The Data Wrangling Workshop is designed for developers, data analysts, and business analysts who are looking to pursue a career as a full-fledged data scientist or analytics expert. Although this book is for beginners who want to start data wrangling, prior working knowledge of the Python programming language is necessary to easily grasp the concepts covered here. It will also help to have a rudimentary knowledge of relational databases and SQL.
Table of contents
- The Data Wrangling Workshop
- Second Edition
- Preface
-
1. Introduction to Data Wrangling with Python
- Introduction
- Importance of Data Wrangling
- Python for Data Wrangling
- Lists, Sets, Strings, Tuples, and Dictionaries
-
List Functions
- Exercise 1.01: Accessing the List Members
- Exercise 1.02: Generating and Iterating through a List
- Exercise 1.03: Iterating over a List and Checking Membership
- Exercise 1.04: Sorting a List
- Exercise 1.05: Generating a Random List
- Activity 1.01: Handling Lists
- Sets
- Introduction to Sets
- Union and Intersection of Sets
- Creating Null Sets
- Dictionary
- Exercise 1.06: Accessing and Setting Values in a Dictionary
- Exercise 1.07: Iterating over a Dictionary
- Exercise 1.08: Revisiting the Unique Valued List Problem
- Exercise 1.09: Deleting a Value from Dict
- Exercise 1.10: Dictionary Comprehension
- Tuples
- Creating a Tuple with Different Cardinalities
- Unpacking a Tuple
- Exercise 1.11: Handling Tuples
- Strings
- Exercise 1.12: Accessing Strings
- Exercise 1.13: String Slices
- String Functions
- Exercise 1.14: Splitting and Joining a String
- Activity 1.02: Analyzing a Multiline String and Generating the Unique Word Count
- Summary
-
2. Advanced Operations on Built-In Data Structures
- Introduction
-
Advanced Data Structures
- Iterator
- Exercise 2.01: Introducing to the Iterator
- Stacks
- Exercise 2.02: Implementing a Stack in Python
- Exercise 2.03: Implementing a Stack Using User-Defined Methods
- Lambda Expressions
- Exercise 2.04: Implementing a Lambda Expression
- Exercise 2.05: Lambda Expression for Sorting
- Exercise 2.06: Multi-Element Membership Checking
- Queue
- Exercise 2.07: Implementing a Queue in Python
- Activity 2.01: Permutation, Iterator, Lambda, and List
- Basic File Operations in Python
- Summary
-
3. Introduction to NumPy, Pandas, and Matplotlib
- Introduction
- NumPy Arrays
-
Advanced Mathematical Operations
- Exercise 3.04: Advanced Mathematical Operations on NumPy Arrays
- Exercise 3.05: Generating Arrays Using arange and linspace Methods
- Exercise 3.06: Creating Multi-Dimensional Arrays
- Exercise 3.07: The Dimension, Shape, Size, and Data Type of Two-dimensional Arrays
- Exercise 3.08: Zeros, Ones, Random, Identity Matrices, and Vectors
- Exercise 3.09: Reshaping, Ravel, Min, Max, and Sorting
- Exercise 3.10: Indexing and Slicing
- Conditional SubSetting
- Exercise 3.11: Array Operations
- Stacking Arrays
- Pandas DataFrames
- Exercise 3.12: Creating a Pandas Series
- Exercise 3.13: Pandas Series and Data Handling
- Exercise 3.14: Creating Pandas DataFrames
- Exercise 3.15: Viewing a DataFrame Partially
- Indexing and Slicing Columns
- Indexing and Slicing Rows
- Exercise 3.16: Creating and Deleting a New Column or Row
- Statistics and Visualization with NumPy and Pandas
- The Definition of Statistical Measures – Central Tendency and Spread
-
Data Wrangling in Statistics and Visualization
- Using NumPy and Pandas to Calculate Basic Descriptive Statistics
- Random Number Generation Using NumPy
- Exercise 3.18: Generating Random Numbers from a Uniform Distribution
- Exercise 3.19: Generating Random Numbers from a Binomial Distribution and Bar Plot
- Exercise 3.20: Generating Random Numbers from a Normal Distribution and Histograms
- Exercise 3.21: Calculating Descriptive Statistics from a DataFrame
- Exercise 3.22: Built-in Plotting Utilities
- Activity 3.01: Generating Statistics from a CSV File
- Summary
-
4. A Deep Dive into Data Wrangling with Python
- Introduction
-
Subsetting, Filtering, and Grouping
- Exercise 4.01: Examining the Superstore Sales Data in an Excel File
- Subsetting the DataFrame
- An Example Use Case – Determining Statistics on Sales and Profit
- Exercise 4.02: The unique Function
- Conditional Selection and Boolean Filtering
- Exercise 4.03: Setting and Resetting the Index
- The GroupBy Method
- Exercise 4.04: The GroupBy Method
- Detecting Outliers and Handling Missing Values
- Concatenating, Merging, and Joining
-
Useful Methods of Pandas
- Randomized Sampling
- Exercise 4.10: Randomized Sampling
- The value_counts Method
- Pivot Table Functionality
- Exercise 4.11: Sorting by Column Values – the sort_values Method
- Exercise 4.12: Flexibility of User-Defined Functions with the apply Method
- Activity 4.01: Working with the Adult Income Dataset (UCI)
- Summary
-
5. Getting Comfortable with Different Kinds of Data Sources
- Introduction
-
Reading Data from Different Sources
- Data Files Provided with This Chapter
- Libraries to Install for This Chapter
- Reading Data Using Pandas
- Exercise 5.01: Working with Headers When Reading Data from a CSV File
- Exercise 5.02: Reading from a CSV File Where Delimiters Are Not Commas
- Exercise 5.03: Bypassing and Renaming the Headers of a CSV File
- Exercise 5.04: Skipping Initial Rows and Footers When Reading a CSV File
- Reading Only the First N Rows
- Exercise 5.05: Combining skiprows and nrows to Read Data in Small Chunks
- Setting the skip_blank_lines Option
- Reading CSV Data from a Zip File
- Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name
- Exercise 5.06: Reading a General Delimited Text File
- Reading HTML Tables Directly from a URL
- Exercise 5.07: Further Wrangling to Get the Desired Data
- Reading from a JSON file
- Exercise 5.08: Reading from a JSON File
- Reading a PDF File
- Exercise 5.09: Reading Tabular Data from a PDF File
-
Introduction to Beautiful Soup 4 and Web Page Parsing
- Structure of HTML
- Exercise 5.10: Reading an HTML File and Extracting Its Contents Using Beautiful Soup
- Exercise 5.11: DataFrames and BeautifulSoup
- Exercise 5.12: Exporting a DataFrame as an Excel File
- Exercise 5.13: Stacking URLs from a Document Using bs4
- Activity 5.01: Reading Tabular Data from a Web Page and Creating DataFrames
- Summary
- 6. Learning the Hidden Secrets of Data Wrangling
-
7. Advanced Web Scraping and Data Gathering
- Introduction
-
The Requests and BeautifulSoup Libraries
- Exercise 7.01: Using the Requests Library to Get a Response from the Wikipedia Home Page
- Exercise 7.02: Checking the Status of the Web Request
- Checking the Encoding of a Web Page
- Exercise 7.03: Decoding the Contents of a Response and Checking Its Length
- Exercise 7.04: Extracting Readable Text from a BeautifulSoup Object
- Extracting Text from a Section
- Extracting Important Historical Events that Happened on Today's Date
- Exercise 7.05: Using Advanced BS4 Techniques to Extract Relevant Text
- Exercise 7.06: Creating a Compact Function to Extract the On this day Text from the Wikipedia Home Page
-
Reading Data from XML
- Exercise 7.07: Creating an XML File and Reading XML Element Objects
- Exercise 7.08: Finding Various Elements of Data within a Tree (Element)
- Reading from a Local XML File into an ElementTree Object
- Exercise 7.09: Traversing the Tree, Finding the Root, and Exploring All the Child Nodes and Their Tags and Attributes
- Exercise 7.10: Using the text Method to Extract Meaningful Data
- Extracting and Printing the GDP/Per Capita Information Using a Loop
- Finding All the Neighboring Countries for Each Country and Printing Them
- Exercise 7.11: A Simple Demo of Using XML Data Obtained by Web Scraping
-
Reading Data from an API
- Defining the Base URL (or API Endpoint)
- Exercise 7.12: Defining and Testing a Function to Pull Country Data from an API
- Using the Built-In JSON Library to Read and Examine Data
- Printing All the Data Elements
- Using a Function that Extracts a DataFrame Containing Key Information
- Exercise 7.13: Testing the Function by Building a Small Database of Country Information
-
Fundamentals of Regular Expressions (RegEx)
- RegEx in the Context of Web Scraping
- Exercise 7.14: Using the match Method to Check Whether a Pattern Matches a String/Sequence
- Using the compile Method to Create a RegEx Program
- Exercise 7.15: Compiling Programs to Match Objects
- Exercise 7.16: Using Additional Parameters in the match Method to Check for Positional Matching
- Finding the Number of Words in a List That End with "ing"
- The search Method in RegEx
- Exercise 7.17: The search Method in RegEx
- Exercise 7.18: Using the span Method of the Match Object to Locate the Position of the Matched Pattern
- Exercise 7.19: Examples of Single-Character Pattern Matching with search
- Exercise 7.20: Handling Pattern Matching at the Start or End of a String
- Exercise 7.21: Pattern Matching with Multiple Characters
- Exercise 7.22: Greedy versus Non-Greedy Matching
- Exercise 7.23: Controlling Repetitions to Match in a Text
- Sets of Matching Characters
- Exercise 7.24: Sets of Matching Characters
- Exercise 7.25: The Use of OR in RegEx Using the OR Operator
- The findall Method
- Activity 7.01: Extracting the Top 100 e-books from Gutenberg
- Activity 7.02: Building Your Own Movie Database by Reading an API
- Summary
-
8. RDBMS and SQL
- Introduction
-
Refresher of RDBMS and SQL
- How Is an RDBMS Structured?
- SQL
- Using an RDBMS (MySQL/PostgreSQL/SQLite)
- Exercise 8.01: Connecting to a Database in SQLite
- DDL and DML Commands in SQLite
- Exercise 8.02: Using DDL and DML Commands in SQLite
- Reading Data from a Database in SQLite
- Exercise 8.03: Sorting Values That Are Present in the Database
- The ALTER Command
- Exercise 8.04: Altering the Structure of a Table and Updating the New Fields
- The GROUP BY clause
- Exercise 8.05: Grouping Values in Tables
- Relation Mapping in Databases
- Joins
- Retrieving Specific Columns from a JOIN Query
- Summary
- 9. Applications in Business Use Cases and Conclusion of the Course
-
Appendix
- 1. Introduction to Data Wrangling with Python
- 2. Advanced Operations on Built-In Data Structures
- 3. Introduction to NumPy, Pandas, and Matplotlib
- 4. A Deep Dive into Data Wrangling with Python
- 5. Getting Comfortable with Different Kinds of Data Sources
- 6. Learning the Hidden Secrets of Data Wrangling
- 7. Advanced Web Scraping and Data Gathering
- 8. RDBMS and SQL
- 9. Applications in Business Use Cases and Conclusion of the Course
Product information
- Title: The Data Wrangling Workshop - Second Edition
- Author(s):
- Release date: July 2020
- Publisher(s): Packt Publishing
- ISBN: 9781839215001
You might also like
book
The Data Science Workshop - Second Edition
Gain expert guidance on how to successfully develop machine learning models in Python and build your …
book
The Data Analysis Workshop
Learn how to analyze data using Python models with the help of real-world use cases and …
book
The Data Visualization Workshop
Explore a modern approach to visualizing data with Python and transform large real-world datasets into expressive …
book
The Applied Data Science Workshop - Second Edition
Designed with beginners in mind, this workshop helps you make the most of Python libraries and …