book

Data Science: The Hard Parts

Name: Data Science: The Hard Parts
Author: Daniel Vaughan
ISBN: 9781098146474

by Daniel Vaughan

November 2023

Beginner to intermediate

254 pages

6h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Data Analytics Techniques
1. So What? Creating Value with Data Science
What Is Value?What: Understanding the BusinessSo What: The Gist of Value Creation in DSNow What: Be a Go-GetterMeasuring ValueKey TakeawaysFurther Reading
2. Metrics Design
Desirable Properties That Metrics Should HaveMeasurableActionableRelevanceTimelinessMetrics DecompositionFunnel AnalyticsStock-Flow DecompositionsP×Q-Type DecompositionsExample: Another Revenue DecompositionExample: MarketplacesKey TakeawaysFurther Reading
3. Growth Decompositions: Understanding Tailwinds and Headwinds
Why Growth Decompositions?Additive DecompositionExampleInterpretation and Use CasesMultiplicative DecompositionExampleInterpretationMix-Rate DecompositionsExampleInterpretationMathematical DerivationsAdditive DecompositionMultiplicative DecompositionMix-Rate DecompositionKey TakeawaysFurther Reading
4. 2×2 Designs
The Case for SimplificationWhat’s a 2×2 Design?Example: Test a Model and a New FeatureExample: Understanding User BehaviorExample: Credit Origination and AcceptanceExample: Prioritizing Your WorkflowKey TakeawaysFurther Reading
5. Building Business Cases
Some Principles to Construct Business CasesExample: Proactive Retention StrategyFraud PreventionPurchasing External DatasetsWorking on a Data Science ProjectKey TakeawaysFurther Reading
6. What’s in a Lift?
Lifts DefinedExample: Classifier ModelSelf-Selection and Survivorship BiasesOther Use Cases for LiftsKey TakeawaysFurther Reading
7. Narratives
What’s in a Narrative: Telling a Story with Your DataClear and to the PointCredibleMemorableActionableBuilding a NarrativeScience as StorytellingWhat, So What, and Now What?The Last MileWriting TL;DRsTips to Write Memorable TL;DRsExample: Writing a TL;DR for This ChapterDelivering Powerful Elevator PitchesPresenting Your NarrativeKey TakeawaysFurther Reading
8. Datavis: Choosing the Right Plot to Deliver a Message
Some Useful and Not-So-Used Data VisualizationsBar Versus Line PlotsSlopegraphsWaterfall ChartsScatterplot SmoothersPlotting DistributionsGeneral RecommendationsFind the Right Datavis for Your MessageChoose Your Colors WiselyDifferent Dimensions in a PlotAim for a Large Enough Data-Ink RatioCustomization Versus SemiautomationGet the Font Size Right from the BeginningInteractive or NotStay SimpleStart by Explaining the PlotKey TakeawaysFurther Reading

II. Machine Learning
9. Simulation and Bootstrapping
Basics of SimulationSimulating a Linear Model and Linear RegressionWhat Are Partial Dependence Plots?Omitted Variable BiasSimulating Classification ProblemsLatent Variable ModelsComparing Different AlgorithmsBootstrappingKey TakeawaysFurther Reading
10. Linear Regression: Going Back to Basics
What’s in a Coefficient?The Frisch-Waugh-Lovell TheoremWhy Should You Care About FWL?ConfoundersAdditional VariablesThe Central Role of Variance in MLKey TakeawaysFurther Reading
11. Data Leakage
What Is Data Leakage?Outcome Is Also a FeatureA Function of the Outcome Is Itself a FeatureBad ControlsMislabeling of a TimestampMultiple Datasets with Sloppy Time AggregationsLeakage of Other InformationDetecting Data LeakageComplete SeparationWindowing MethodologyChoosing the Length of the WindowsThe Training Stage Mirrors the Scoring StageImplementing the Windowing MethodologyI Have Leakage: Now What?Key TakeawaysFurther Reading
12. Productionizing Models
What Does “Production Ready” Mean?Batch Scores (Offline)Real-Time Model ObjectsData and Model DriftEssential Steps in any Production PipelineGet and Transform DataValidate DataTraining and Scoring StagesValidate Model and ScoresDeploy Model and ScoresKey TakeawaysFurther Reading
13. Storytelling in Machine Learning
A Holistic View of Storytelling in MLEx Ante and Interim StorytellingCreating HypothesesFeature EngineeringEx Post Storytelling: Opening the Black BoxInterpretability-Performance Trade-OffLinear Regression: Setting a BenchmarkFeature ImportanceHeatmapsPartial Dependence PlotsAccumulated Local EffectsKey TakeawaysFurther Reading
14. From Prediction to Decisions
Dissecting Decision MakingSimple Decision Rules by Smart ThresholdingPrecision and RecallExample: Lead GenerationConfusion Matrix OptimizationKey TakeawaysFurther Reading
15. Incrementality: The Holy Grail of Data Science?
Defining IncrementalityCausal Reasoning to Improve PredictionCausal Reasoning as a DifferentiatorImproved Decision MakingConfounders and CollidersSelection BiasUnconfoundedness AssumptionBreaking Selection Bias: RandomizationMatchingMachine Learning and Causal InferenceOpen Source CodebasesDouble Machine LearningKey TakeawaysFurther Reading
16. A/B Tests
What Is an A/B Test?Decision CriterionMinimum Detectable EffectsChoosing the Statistical Power, Level, and PEstimating the Variance of the OutcomeSimulationsExample: Conversion RatesSetting the MDEHypotheses BacklogMetricHypothesisRankingGovernance of ExperimentsKey TakeawaysFurther Reading
17. Large Language Models and the Practice of Data Science
The Current State of AIWhat Do Data Scientists Do?Evolving the Data Scientist’s Job DescriptionCase Study: A/B TestingCase Study: Data CleansingCase Study: Machine LearningLLMs and This BookKey TakeawaysFurther Reading
Index
About the Author

Content preview from Data Science: The Hard Parts

Chapter 10. Linear Regression: Going Back to Basics

Linear regression (OLS¹) is the first machine learning algorithm most data scientists learn, but it has become more of an intellectual curiosity with the advent of more powerful nonlinear alternatives, like gradient boosting regression. Because of this, many practitioners don’t know many properties of OLS that are very helpful to gain some intuition about learning algorithms. This chapter goes through some of these important properties and highlights their significance.

What’s in a Coefficient?

Let’s start with the simplest setting with only one feature:

y = α_{0} + α_{1} x_{1} + ϵ

The first parameter is the constant or intercept, and the second parameter is the slope, as you may recall from the typical functional form for a line.

Since the residuals are mean zero, by taking partial derivatives you can see that:

\begin{matrix} α_{1} & = & \frac{\partial E (y)}{\partial x_{1}} \\ α_{0} & = & E (y) - α_{1} E (x_{1}) \end{matrix}

As discussed in Chapter 9, the first equation is quite useful for interpretability reasons, since it says that a one-unit change in the feature is associated with a change in $alpha 1$ units of the outcome, on average. However, as I will now show, you must be careful not to give it a causal interpretation.

By substituting the definition of the outcome inside the covariance, you can also show that:

\begin{matrix} α_{1} & = & \frac{C o v (y, x_{1})}{V a r (x_{1})} \end{matrix}

In a bivariate setting, the slope depends on the covariance between the outcome and the feature, and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098146467Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science: The Hard Parts

by Daniel Vaughan

Chapter 10. Linear Regression: Going Back to Basics

What’s in a Coefficient?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.