Learning Data Science

Book description

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data.

Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas.

  • Refine a question of interest to one that can be studied with data
  • Pursue data collection that may involve text processing, web scraping, etc.
  • Glean valuable insights about data through data cleaning, exploration, and visualization
  • Learn how to use modeling to describe the data
  • Generalize findings beyond the data

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Expected Background Knowledge
    2. Organization of the Book
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
  2. I. The Data Science Lifecycle
  3. 1. The Data Science Lifecycle
    1. The Stages of the Lifecycle
    2. Examples of the Lifecycle
    3. Summary
  4. 2. Questions and Data Scope
    1. Big Data and New Opportunities
      1. Example: Google Flu Trends
    2. Target Population, Access Frame, and Sample
      1. Example: What Makes Members of an Online Community Active?
      2. Example: Who Will Win the Election?
      3. Example: How Do Environmental Hazards Relate to an Individual’s Health?
    3. Instruments and Protocols
    4. Measuring Natural Phenomena
      1. Example: What Is the Level of CO2 in the Air?
    5. Accuracy
      1. Types of Bias
      2. Types of Variation
    6. Summary
  5. 3. Simulation and Data Design
    1. The Urn Model
      1. Sampling Designs
      2. Sampling Distribution of a Statistic
      3. Simulating the Sampling Distribution
      4. Simulation with the Hypergeometric Distribution
    2. Example: Simulating Election Poll Bias and Variance
      1. The Pennsylvania Urn Model
      2. An Urn Model with Bias
      3. Conducting Larger Polls
    3. Example: Simulating a Randomized Trial for a Vaccine
      1. Scope
      2. The Urn Model for Random Assignment
    4. Example: Measuring Air Quality
    5. Summary
  6. 4. Modeling with Summary Statistics
    1. The Constant Model
    2. Minimizing Loss
      1. Mean Absolute Error
      2. Mean Squared Error
      3. Choosing Loss Functions
    3. Summary
  7. 5. Case Study: Why Is My Bus Always Late?
    1. Question and Scope
    2. Data Wrangling
    3. Exploring Bus Times
    4. Modeling Wait Times
    5. Summary
  8. II. Rectangular Data
  9. 6. Working with Dataframes Using pandas
    1. Subsetting
      1. Data Scope and Question
      2. Dataframes and Indices
      3. Slicing
      4. Filtering Rows
      5. Example: How Recently Has Luna Become a Popular Name?
    2. Aggregating
      1. Basic Group-Aggregate
      2. Grouping on Multiple Columns
      3. Custom Aggregation Functions
      4. Pivoting
    3. Joining
      1. Inner Joins
      2. Left, Right, and Outer Joins
      3. Example: Popularity of NYT Name Categories
    4. Transforming
      1. Apply
      2. Example: Popularity of “L” Names
      3. The Price of Apply
    5. How Are Dataframes Different from Other Data Representations?
      1. Dataframes and Spreadsheets
      2. Dataframes and Matrices
      3. Dataframes and Relations
    6. Summary
  10. 7. Working with Relations Using SQL
    1. Subsetting
      1. SQL Basics: SELECT and FROM
      2. What’s a Relation?
      3. Slicing
      4. Filtering Rows
      5. Example: How Recently Has Luna Become a Popular Name?
    2. Aggregating
      1. Basic Group-Aggregate Using GROUP BY
      2. Grouping on Multiple Columns
      3. Other Aggregation Functions
    3. Joining
      1. Inner Joins
      2. Left and Right Joins
      3. Example: Popularity of NYT Name Categories
    4. Transforming and Common Table Expressions
      1. SQL Functions
      2. Multistep Queries Using a WITH Clause
      3. Example: Popularity of “L” Names
    5. Summary
  11. III. Understanding The Data
  12. 8. Wrangling Files
    1. Data Source Examples
      1. Drug Abuse Warning Network (DAWN) Survey
      2. San Francisco Restaurant Food Safety
    2. File Formats
      1. Delimited Format
      2. Fixed-Width Format
      3. Hierarchical Formats
      4. Loosely Formatted Text
    3. File Encoding
    4. File Size
    5. The Shell and Command-Line Tools
    6. Table Shape and Granularity
      1. Granularity of Restaurant Inspections and Violations
      2. DAWN Survey Shape and Granularity
    7. Summary
  13. 9. Wrangling Dataframes
    1. Example: Wrangling CO2 Measurements from the Mauna Loa Observatory
      1. Quality Checks
      2. Addressing Missing Data
      3. Reshaping the Data Table
    2. Quality Checks
      1. Quality Based on Scope
      2. Quality of Measurements and Recorded Values
      3. Quality Across Related Features
      4. Quality for Analysis
      5. Fixing the Data or Not
    3. Missing Values and Records
    4. Transformations and Timestamps
      1. Transforming Timestamps
      2. Piping for Transformations
    5. Modifying Structure
    6. Example: Wrangling Restaurant Safety Violations
      1. Narrowing the Focus
      2. Aggregating Violations
      3. Extracting Information from Violation Descriptions
    7. Summary
  14. 10. Exploratory Data Analysis
    1. Feature Types
      1. Example: Dog Breeds
      2. Transforming Qualitative Features
      3. The Importance of Feature Types
    2. What to Look For in a Distribution
    3. What to Look For in a Relationship
      1. Two Quantitative Features
      2. One Qualitative and One Quantitative Variable
      3. Two Qualitative Features
    4. Comparisons in Multivariate Settings
    5. Guidelines for Exploration
    6. Example: Sale Prices for Houses
      1. Understanding Price
      2. What Next?
      3. Examining Other Features
      4. Delving Deeper into Relationships
      5. Fixing Location
      6. EDA Discoveries
    7. Summary
  15. 11. Data Visualization
    1. Choosing Scale to Reveal Structure
      1. Filling the Data Region
      2. Including Zero
      3. Revealing Shape Through Transformations
      4. Banking to Decipher Relationships
      5. Revealing Relationships Through Straightening
    2. Smoothing and Aggregating Data
      1. Smoothing Techniques to Uncover Shape
      2. Smoothing Techniques to Uncover Relationships and Trends
      3. Smoothing Techniques Need Tuning
      4. Reducing Distributions to Quantiles
      5. When Not to Smooth
    3. Facilitating Meaningful Comparisons
      1. Emphasize the Important Difference
      2. Ordering Groups
      3. Avoid Stacking
      4. Selecting a Color Palette
      5. Guidelines for Comparisons in Plots
    4. Incorporating the Data Design
      1. Data Collected Over Time
      2. Observational Studies
      3. Unequal Sampling
      4. Geographic Data
    5. Adding Context
      1. Example: 100m Sprint Times
    6. Creating Plots Using plotly
      1. Figure and Trace Objects
      2. Modifying Layout
      3. Plotting Functions
      4. Annotations
    7. Other Tools for Visualization
      1. matplotlib
      2. Grammar of Graphics
    8. Summary
  16. 12. Case Study: How Accurate Are Air Quality Measurements?
    1. Question, Design, and Scope
    2. Finding Collocated Sensors
      1. Wrangling the List of AQS Sites
      2. Wrangling the List of PurpleAir Sites
      3. Matching AQS and PurpleAir Sensors
    3. Wrangling and Cleaning AQS Sensor Data
      1. Checking Granularity
      2. Removing Unneeded Columns
      3. Checking the Validity of Dates
      4. Checking the Quality of PM2.5 Measurements
    4. Wrangling PurpleAir Sensor Data
      1. Checking the Granularity
      2. Handling Missing Values
    5. Exploring PurpleAir and AQS Measurements
    6. Creating a Model to Correct PurpleAir Measurements
    7. Summary
  17. IV. Other Data Sources
  18. 13. Working with Text
    1. Examples of Text and Tasks
      1. Convert Text into a Standard Format
      2. Extract a Piece of Text to Create a Feature
      3. Transform Text into Features
      4. Text Analysis
    2. String Manipulation
      1. Converting Text to a Standard Format with Python String Methods
      2. String Methods in pandas
      3. Splitting Strings to Extract Pieces of Text
    3. Regular Expressions
      1. Concatenation of Literals
      2. Quantifiers
      3. Alternation and Grouping to Create Features
      4. Reference Tables
    4. Text Analysis
    5. Summary
  19. 14. Data Exchange
    1. NetCDF Data
    2. JSON Data
    3. HTTP
    4. REST
    5. XML, HTML, and XPath
      1. Example: Scraping Race Times from Wikipedia
      2. XPath
      3. Example: Accessing Exchange Rates from the ECB
    6. Summary
  20. V. Linear Modeling
  21. 15. Linear Models
    1. Simple Linear Model
    2. Example: A Simple Linear Model for Air Quality
      1. Interpreting Linear Models
      2. Assessing the Fit
    3. Fitting the Simple Linear Model
    4. Multiple Linear Model
    5. Fitting the Multiple Linear Model
    6. Example: Where Is the Land of Opportunity?
      1. Explaining Upward Mobility Using Commute Time
      2. Relating Upward Mobility Using Multiple Variables
    7. Feature Engineering for Numeric Measurements
    8. Feature Engineering for Categorical Measurements
    9. Summary
  22. 16. Model Selection
    1. Overfitting
      1. Example: Energy Consumption
    2. Train-Test Split
    3. Cross-Validation
    4. Regularization
    5. Model Bias and Variance
    6. Summary
  23. 17. Theory for Inference and Prediction
    1. Distributions: Population, Empirical, Sampling
    2. Basics of Hypothesis Testing
      1. Example: A Rank Test to Compare Productivity of Wikipedia Contributors
      2. Example: A Test of Proportions for Vaccine Efficacy
    3. Bootstrapping for Inference
    4. Basics of Confidence Intervals
    5. Basics of Prediction Intervals
      1. Example: Predicting Bus Lateness
      2. Example: Predicting Crab Size
      3. Example: Predicting the Incremental Growth of a Crab
    6. Probability for Inference and Prediction
      1. Formalizing the Theory for Average Rank Statistics
      2. General Properties of Random Variables
      3. Probability Behind Testing and Intervals
      4. Probability Behind Model Selection
    7. Summary
  24. 18. Case Study: How to Weigh a Donkey
    1. Donkey Study Question and Scope
    2. Wrangling and Transforming
    3. Exploring
    4. Modeling a Donkey’s Weight
      1. A Loss Function for Prescribing Anesthetics
      2. Fitting a Simple Linear Model
      3. Fitting a Multiple Linear Model
      4. Bringing Qualitative Features into the Model
      5. Model Assessment
    5. Summary
  25. VI. Classification
  26. 19. Classification
    1. Example: Wind-Damaged Trees
    2. Modeling and Classification
      1. A Constant Model
      2. Examining the Relationship Between Size and Windthrow
    3. Modeling Proportions (and Probabilities)
      1. A Logistic Model
      2. Log Odds
      3. Using a Logistic Curve
    4. A Loss Function for the Logistic Model
    5. From Probabilities to Classification
      1. The Confusion Matrix
      2. Precision Versus Recall
    6. Summary
  27. 20. Numerical Optimization
    1. Gradient Descent Basics
    2. Minimizing Huber Loss
    3. Convex and Differentiable Loss Functions
    4. Variants of Gradient Descent
      1. Stochastic Gradient Descent
      2. Mini-Batch Gradient Descent
      3. Newton’s Method
    5. Summary
  28. 21. Case Study: Detecting Fake News
    1. Question and Scope
    2. Obtaining and Wrangling the Data
    3. Exploring the Data
      1. Exploring the Publishers
      2. Exploring Publication Date
      3. Exploring Words in Articles
    4. Modeling
      1. A Single-Word Model
      2. Multiple-Word Model
      3. Predicting with the tf-idf Transform
    5. Summary
  29. Additional Material
  30. Data Sources
  31. Index
  32. About the Authors

Product information

  • Title: Learning Data Science
  • Author(s): Sam Lau, Joseph Gonzalez, Deborah Nolan
  • Release date: September 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098113001