Data Science: The Hard Parts

Book description

This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one.

Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries.

With this book, you will:

  • Understand how data science creates value
  • Deliver compelling narratives to sell your data science project
  • Build a business case using unit economics principles
  • Create new features for a ML model using storytelling
  • Learn how to decompose KPIs
  • Perform growth decompositions to find root causes for changes in a metric

Daniel Vaughan is head of data at Clip, the leading paytech company in Mexico. He's the author of Analytical Skills for AI and Data Science (O'Reilly).

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. I. Data Analytics Techniques
  3. 1. So What? Creating Value with Data Science
    1. What Is Value?
    2. What: Understanding the Business
    3. So What: The Gist of Value Creation in DS
    4. Now What: Be a Go-Getter
    5. Measuring Value
    6. Key Takeaways
    7. Further Reading
  4. 2. Metrics Design
    1. Desirable Properties That Metrics Should Have
      1. Measurable
      2. Actionable
      3. Relevance
      4. Timeliness
    2. Metrics Decomposition
      1. Funnel Analytics
      2. Stock-Flow Decompositions
      3. P×Q-Type Decompositions
    3. Example: Another Revenue Decomposition
    4. Example: Marketplaces
    5. Key Takeaways
    6. Further Reading
  5. 3. Growth Decompositions: Understanding Tailwinds and Headwinds
    1. Why Growth Decompositions?
    2. Additive Decomposition
      1. Example
      2. Interpretation and Use Cases
    3. Multiplicative Decomposition
      1. Example
      2. Interpretation
    4. Mix-Rate Decompositions
      1. Example
      2. Interpretation
    5. Mathematical Derivations
      1. Additive Decomposition
      2. Multiplicative Decomposition
      3. Mix-Rate Decomposition
    6. Key Takeaways
    7. Further Reading
  6. 4. 2×2 Designs
    1. The Case for Simplification
    2. What’s a 2×2 Design?
    3. Example: Test a Model and a New Feature
    4. Example: Understanding User Behavior
    5. Example: Credit Origination and Acceptance
    6. Example: Prioritizing Your Workflow
    7. Key Takeaways
    8. Further Reading
  7. 5. Building Business Cases
    1. Some Principles to Construct Business Cases
    2. Example: Proactive Retention Strategy
    3. Fraud Prevention
    4. Purchasing External Datasets
    5. Working on a Data Science Project
    6. Key Takeaways
    7. Further Reading
  8. 6. What’s in a Lift?
    1. Lifts Defined
    2. Example: Classifier Model
    3. Self-Selection and Survivorship Biases
    4. Other Use Cases for Lifts
    5. Key Takeaways
    6. Further Reading
  9. 7. Narratives
    1. What’s in a Narrative: Telling a Story with Your Data
      1. Clear and to the Point
      2. Credible
      3. Memorable
      4. Actionable
    2. Building a Narrative
      1. Science as Storytelling
      2. What, So What, and Now What?
    3. The Last Mile
      1. Writing TL;DRs
      2. Tips to Write Memorable TL;DRs
      3. Example: Writing a TL;DR for This Chapter
      4. Delivering Powerful Elevator Pitches
      5. Presenting Your Narrative
    4. Key Takeaways
    5. Further Reading
  10. 8. Datavis: Choosing the Right Plot to Deliver a Message
    1. Some Useful and Not-So-Used Data Visualizations
      1. Bar Versus Line Plots
      2. Slopegraphs
      3. Waterfall Charts
      4. Scatterplot Smoothers
      5. Plotting Distributions
    2. General Recommendations
      1. Find the Right Datavis for Your Message
      2. Choose Your Colors Wisely
      3. Different Dimensions in a Plot
      4. Aim for a Large Enough Data-Ink Ratio
      5. Customization Versus Semiautomation
      6. Get the Font Size Right from the Beginning
      7. Interactive or Not
      8. Stay Simple
      9. Start by Explaining the Plot
    3. Key Takeaways
    4. Further Reading
  11. II. Machine Learning
  12. 9. Simulation and Bootstrapping
    1. Basics of Simulation
    2. Simulating a Linear Model and Linear Regression
    3. What Are Partial Dependence Plots?
    4. Omitted Variable Bias
    5. Simulating Classification Problems
      1. Latent Variable Models
      2. Comparing Different Algorithms
    6. Bootstrapping
    7. Key Takeaways
    8. Further Reading
  13. 10. Linear Regression: Going Back to Basics
    1. What’s in a Coefficient?
    2. The Frisch-Waugh-Lovell Theorem
    3. Why Should You Care About FWL?
    4. Confounders
    5. Additional Variables
    6. The Central Role of Variance in ML
    7. Key Takeaways
    8. Further Reading
  14. 11. Data Leakage
    1. What Is Data Leakage?
      1. Outcome Is Also a Feature
      2. A Function of the Outcome Is Itself a Feature
      3. Bad Controls
      4. Mislabeling of a Timestamp
      5. Multiple Datasets with Sloppy Time Aggregations
      6. Leakage of Other Information
    2. Detecting Data Leakage
    3. Complete Separation
    4. Windowing Methodology
      1. Choosing the Length of the Windows
      2. The Training Stage Mirrors the Scoring Stage
      3. Implementing the Windowing Methodology
    5. I Have Leakage: Now What?
    6. Key Takeaways
    7. Further Reading
  15. 12. Productionizing Models
    1. What Does “Production Ready” Mean?
      1. Batch Scores (Offline)
      2. Real-Time Model Objects
    2. Data and Model Drift
    3. Essential Steps in any Production Pipeline
      1. Get and Transform Data
      2. Validate Data
      3. Training and Scoring Stages
      4. Validate Model and Scores
      5. Deploy Model and Scores
    4. Key Takeaways
    5. Further Reading
  16. 13. Storytelling in Machine Learning
    1. A Holistic View of Storytelling in ML
    2. Ex Ante and Interim Storytelling
      1. Creating Hypotheses
      2. Feature Engineering
    3. Ex Post Storytelling: Opening the Black Box
      1. Interpretability-Performance Trade-Off
      2. Linear Regression: Setting a Benchmark
      3. Feature Importance
      4. Heatmaps
      5. Partial Dependence Plots
      6. Accumulated Local Effects
    4. Key Takeaways
    5. Further Reading
  17. 14. From Prediction to Decisions
    1. Dissecting Decision Making
    2. Simple Decision Rules by Smart Thresholding
      1. Precision and Recall
      2. Example: Lead Generation
    3. Confusion Matrix Optimization
    4. Key Takeaways
    5. Further Reading
  18. 15. Incrementality: The Holy Grail of Data Science?
    1. Defining Incrementality
      1. Causal Reasoning to Improve Prediction
      2. Causal Reasoning as a Differentiator
      3. Improved Decision Making
    2. Confounders and Colliders
    3. Selection Bias
    4. Unconfoundedness Assumption
    5. Breaking Selection Bias: Randomization
    6. Matching
    7. Machine Learning and Causal Inference
      1. Open Source Codebases
      2. Double Machine Learning
    8. Key Takeaways
    9. Further Reading
  19. 16. A/B Tests
    1. What Is an A/B Test?
    2. Decision Criterion
    3. Minimum Detectable Effects
      1. Choosing the Statistical Power, Level, and P
      2. Estimating the Variance of the Outcome
      3. Simulations
      4. Example: Conversion Rates
      5. Setting the MDE
    4. Hypotheses Backlog
      1. Metric
      2. Hypothesis
      3. Ranking
    5. Governance of Experiments
    6. Key Takeaways
    7. Further Reading
  20. 17. Large Language Models and the Practice of Data Science
    1. The Current State of AI
    2. What Do Data Scientists Do?
    3. Evolving the Data Scientist’s Job Description
      1. Case Study: A/B Testing
      2. Case Study: Data Cleansing
      3. Case Study: Machine Learning
    4. LLMs and This Book
    5. Key Takeaways
    6. Further Reading
  21. Index
  22. About the Author

Product information

  • Title: Data Science: The Hard Parts
  • Author(s): Daniel Vaughan
  • Release date: November 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098146474