book

Data Science: The Hard Parts

Name: Data Science: The Hard Parts
Author: Daniel Vaughan
ISBN: 9781098146474

by Daniel Vaughan

November 2023

Beginner to intermediate

254 pages

6h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Data Analytics Techniques
1. So What? Creating Value with Data Science
What Is Value?What: Understanding the BusinessSo What: The Gist of Value Creation in DSNow What: Be a Go-GetterMeasuring ValueKey TakeawaysFurther Reading
2. Metrics Design
Desirable Properties That Metrics Should HaveMeasurableActionableRelevanceTimelinessMetrics DecompositionFunnel AnalyticsStock-Flow DecompositionsP×Q-Type DecompositionsExample: Another Revenue DecompositionExample: MarketplacesKey TakeawaysFurther Reading
3. Growth Decompositions: Understanding Tailwinds and Headwinds
Why Growth Decompositions?Additive DecompositionExampleInterpretation and Use CasesMultiplicative DecompositionExampleInterpretationMix-Rate DecompositionsExampleInterpretationMathematical DerivationsAdditive DecompositionMultiplicative DecompositionMix-Rate DecompositionKey TakeawaysFurther Reading
4. 2×2 Designs
The Case for SimplificationWhat’s a 2×2 Design?Example: Test a Model and a New FeatureExample: Understanding User BehaviorExample: Credit Origination and AcceptanceExample: Prioritizing Your WorkflowKey TakeawaysFurther Reading
5. Building Business Cases
Some Principles to Construct Business CasesExample: Proactive Retention StrategyFraud PreventionPurchasing External DatasetsWorking on a Data Science ProjectKey TakeawaysFurther Reading
6. What’s in a Lift?
Lifts DefinedExample: Classifier ModelSelf-Selection and Survivorship BiasesOther Use Cases for LiftsKey TakeawaysFurther Reading
7. Narratives
What’s in a Narrative: Telling a Story with Your DataClear and to the PointCredibleMemorableActionableBuilding a NarrativeScience as StorytellingWhat, So What, and Now What?The Last MileWriting TL;DRsTips to Write Memorable TL;DRsExample: Writing a TL;DR for This ChapterDelivering Powerful Elevator PitchesPresenting Your NarrativeKey TakeawaysFurther Reading
8. Datavis: Choosing the Right Plot to Deliver a Message
Some Useful and Not-So-Used Data VisualizationsBar Versus Line PlotsSlopegraphsWaterfall ChartsScatterplot SmoothersPlotting DistributionsGeneral RecommendationsFind the Right Datavis for Your MessageChoose Your Colors WiselyDifferent Dimensions in a PlotAim for a Large Enough Data-Ink RatioCustomization Versus SemiautomationGet the Font Size Right from the BeginningInteractive or NotStay SimpleStart by Explaining the PlotKey TakeawaysFurther Reading

II. Machine Learning
9. Simulation and Bootstrapping
Basics of SimulationSimulating a Linear Model and Linear RegressionWhat Are Partial Dependence Plots?Omitted Variable BiasSimulating Classification ProblemsLatent Variable ModelsComparing Different AlgorithmsBootstrappingKey TakeawaysFurther Reading
10. Linear Regression: Going Back to Basics
What’s in a Coefficient?The Frisch-Waugh-Lovell TheoremWhy Should You Care About FWL?ConfoundersAdditional VariablesThe Central Role of Variance in MLKey TakeawaysFurther Reading
11. Data Leakage
What Is Data Leakage?Outcome Is Also a FeatureA Function of the Outcome Is Itself a FeatureBad ControlsMislabeling of a TimestampMultiple Datasets with Sloppy Time AggregationsLeakage of Other InformationDetecting Data LeakageComplete SeparationWindowing MethodologyChoosing the Length of the WindowsThe Training Stage Mirrors the Scoring StageImplementing the Windowing MethodologyI Have Leakage: Now What?Key TakeawaysFurther Reading
12. Productionizing Models
What Does “Production Ready” Mean?Batch Scores (Offline)Real-Time Model ObjectsData and Model DriftEssential Steps in any Production PipelineGet and Transform DataValidate DataTraining and Scoring StagesValidate Model and ScoresDeploy Model and ScoresKey TakeawaysFurther Reading
13. Storytelling in Machine Learning
A Holistic View of Storytelling in MLEx Ante and Interim StorytellingCreating HypothesesFeature EngineeringEx Post Storytelling: Opening the Black BoxInterpretability-Performance Trade-OffLinear Regression: Setting a BenchmarkFeature ImportanceHeatmapsPartial Dependence PlotsAccumulated Local EffectsKey TakeawaysFurther Reading
14. From Prediction to Decisions
Dissecting Decision MakingSimple Decision Rules by Smart ThresholdingPrecision and RecallExample: Lead GenerationConfusion Matrix OptimizationKey TakeawaysFurther Reading
15. Incrementality: The Holy Grail of Data Science?
Defining IncrementalityCausal Reasoning to Improve PredictionCausal Reasoning as a DifferentiatorImproved Decision MakingConfounders and CollidersSelection BiasUnconfoundedness AssumptionBreaking Selection Bias: RandomizationMatchingMachine Learning and Causal InferenceOpen Source CodebasesDouble Machine LearningKey TakeawaysFurther Reading
16. A/B Tests
What Is an A/B Test?Decision CriterionMinimum Detectable EffectsChoosing the Statistical Power, Level, and PEstimating the Variance of the OutcomeSimulationsExample: Conversion RatesSetting the MDEHypotheses BacklogMetricHypothesisRankingGovernance of ExperimentsKey TakeawaysFurther Reading
17. Large Language Models and the Practice of Data Science
The Current State of AIWhat Do Data Scientists Do?Evolving the Data Scientist’s Job DescriptionCase Study: A/B TestingCase Study: Data CleansingCase Study: Machine LearningLLMs and This BookKey TakeawaysFurther Reading
Index
About the Author

Content preview from Data Science: The Hard Parts

Chapter 6. What’s in a Lift?

There are very simple techniques that help you accomplish many different tasks. Lifts are one of those tools. Unfortunately, many data scientists don’t understand lifts or haven’t seen their usefulness. This short chapter will help you master them.

Lifts Defined

Generally speaking, a lift is the ratio of an aggregate metric for one group to another. The most common aggregation method is taking averages, as these are the natural sample estimates for expected values. You’ll see some examples in this chapter.

Lift (metric, A, B) = \frac{Metric aggregate for group A}{Metric aggregate for group B}

In the more classical data mining literature, the aggregate is a frequency or probability, and group A is a subset of group B, which is usually the population under study. The objective here is to measure the performance of a selection algorithm (for example, clustering or a classifier) relative to the population average.

Consider the lift of having women as CEOs in the US. Under a random selection baseline, there should be roughly 50% female CEOs. One study estimates this number at 32%. The lift of the current job market selection mechanism is 0.32/0.5 = 0.64, so women are underrepresented relative to the baseline population frequency.

As the name suggests, the lift measures how much the aggregate in one group increases or decreases relative to the baseline. A ratio larger or smaller than one is known as uplift or downlift, respectively. If there’s no lift, the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098146467Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science: The Hard Parts

by Daniel Vaughan

Chapter 6. What’s in a Lift?

Lifts Defined

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.