book

Doing Data Science

by Cathy O'Neil, Rachel Schutt

October 2013

Beginner

405 pages

10h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

MotivationOrigins of the ClassOrigins of the BookWhat to Expect from This BookHow This Book Is OrganizedHow to Read This BookHow Code Is Used in This BookWho This Book Is ForPrerequisitesSupplemental ReadingAbout the ContributorsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Big Data and Data Science HypeGetting Past the HypeWhy Now?DataficationThe Current Landscape (with a Little History)Data Science JobsA Data Science ProfileThought Experiment: Meta-DefinitionOK, So What Is a Data Scientist, Really?In AcademiaIn Industry
Statistical Thinking in the Age of Big DataStatistical InferencePopulations and SamplesPopulations and Samples of Big DataBig Data Can Mean Big AssumptionsModelingExploratory Data AnalysisPhilosophy of Exploratory Data AnalysisExercise: EDAThe Data Science ProcessA Data Scientist’s Role in This ProcessThought Experiment: How Would You Simulate Chaos?Case Study: RealDirectHow Does RealDirect Make Money?Exercise: RealDirect Data Strategy
Machine Learning AlgorithmsThree Basic AlgorithmsLinear Regressionk-Nearest Neighbors (k-NN)k-meansExercise: Basic Machine Learning AlgorithmsSolutionsSumming It All UpThought Experiment: Automated Statistician
Thought Experiment: Learning by ExampleWhy Won’t Linear Regression Work for Filtering Spam?How About k-nearest Neighbors?Naive BayesBayes LawA Spam Filter for Individual WordsA Spam Filter That Combines Words: Naive BayesFancy It Up: Laplace SmoothingComparing Naive Bayes to k-NNSample Code in bashScraping the Web: APIs and Other ToolsJake’s Exercise: Naive Bayes for Article ClassificationSample R Code for Dealing with the NYT API
Thought ExperimentsClassifiersRuntimeYouInterpretabilityScalabilityM6D Logistic Regression Case StudyClick ModelsThe Underlying MathEstimating α and βNewton’s MethodStochastic Gradient DescentImplementationEvaluationMedia 6 Degrees ExerciseSample R Code
Kyle Teague and GetGlueTimestampsExploratory Data Analysis (EDA)Metrics and New Variables or FeaturesWhat’s Next?Cathy O’NeilThought ExperimentFinancial ModelingIn-Sample, Out-of-Sample, and CausalityPreparing Financial DataLog ReturnsExample: The S&P IndexWorking out a Volatility MeasurementExponential DownweightingThe Financial Modeling Feedback LoopWhy Regression?Adding PriorsA Baby ModelExercise: GetGlue and Timestamped Event DataExercise: Financial Data
William CukierskiBackground: Data Science CompetitionsBackground: CrowdsourcingThe Kaggle ModelA Single ContestantTheir CustomersThought Experiment: What Are the Ethical Implications of a Robo-Grader?Feature SelectionExample: User RetentionFiltersWrappersEmbedded Methods: Decision TreesEntropyThe Decision Tree AlgorithmHandling Continuous Variables in Decision TreesRandom ForestsUser Retention: Interpretability Versus Predictive PowerDavid Huffaker: Google’s Hybrid Approach to Social ResearchMoving from Descriptive to PredictiveSocial at GooglePrivacyThought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
A Real-World Recommendation EngineNearest Neighbor Algorithm ReviewSome Problems with Nearest NeighborsBeyond Nearest Neighbor: Machine Learning ClassificationThe Dimensionality ProblemSingular Value Decomposition (SVD)Important Properties of SVDPrincipal Component Analysis (PCA)Alternating Least SquaresFix V and Update ULast Thoughts on These AlgorithmsThought Experiment: Filter BubblesExercise: Build Your Own Recommendation SystemSample Code in Python
Data Visualization HistoryGabriel TardeMark’s Thought ExperimentWhat Is Data Science, Redux?ProcessingFranco MorettiA Sample of Data Visualization ProjectsMark’s Data Visualization ProjectsNew York Times Lobby: Moveable TypeProject Cascade: Lives on a ScreenCronkite PlazaeBay Transactions and BooksPublic Theater Shakespeare MachineGoals of These ExhibitsData Science and RiskAbout SquareThe Risk ChallengeThe Trouble with Performance EstimationModel Building TipsData Visualization at SquareIan’s Thought ExperimentData Visualization for the Rest of UsData Visualization Exercise

Social Network Analysis at Morning AnalyticsCase-Attribute Data versus Social Network DataSocial Network AnalysisTerminology from Social NetworksCentrality MeasuresThe Industry of Centrality MeasuresThought ExperimentMorningside AnalyticsHow Visualizations Help Us Find Schools of FishMore Background on Social Network Analysis from a Statistical Point of ViewRepresentations of Networks and Eigenvalue CentralityA First Example of Random Graphs: The Erdos-Renyi ModelA Second Example of Random Graphs: The Exponential Random Graph ModelData JournalismA Bit of History on Data JournalismWriting Technical Journalism: Advice from an Expert
Correlation Doesn’t Imply CausationAsking Causal QuestionsConfounders: A Dating ExampleOK Cupid’s AttemptThe Gold Standard: Randomized Clinical TrialsA/B TestsSecond Best: Observational StudiesSimpson’s ParadoxThe Rubin Causal ModelVisualizing CausalityDefinition: The Causal EffectThree Pieces of Advice
Madigan’s BackgroundThought ExperimentModern Academic StatisticsMedical Literature and Observational StudiesStratification Does Not Solve the Confounder ProblemWhat Do People Do About Confounding Things in Practice?Is There a Better Way?Research Experiment (Observational Medical Outcomes Partnership)Closing Thought Experiment
Claudia’s Data Scientist ProfileThe Life of a Chief Data ScientistOn Being a Female Data ScientistData Mining CompetitionsHow to Be a Good ModelerData LeakageMarket PredictionsAmazon Case Study: Big SpendersA Jewelry Sampling ProblemIBM Customer TargetingBreast Cancer DetectionPneumonia PredictionHow to Avoid LeakageEvaluating ModelsAccuracy: MehProbabilities Matter, Not 0s and 1sChoosing an AlgorithmA Final ExampleParting Thoughts
About David CrawshawThought ExperimentMapReduceWord Frequency ProblemEnter MapReduceOther Examples of MapReduceWhat Can’t MapReduce Do?PregelAbout Josh WillsThought ExperimentOn Being a Data ScientistData Abundance Versus Data ScarcityDesigning ModelsEconomic Interlude: HadoopA Brief Introduction to HadoopClouderaBack to Josh: WorkflowSo How to Get Started with Hadoop?
Process ThinkingNaive No LongerHelping HandsYour Mileage May VaryBridging TunnelsSome of Our Work
What Just Happened?What Is Data Science (Again)?What Are Next-Gen Data Scientists?Being Problem SolversCultivating Soft SkillsBeing Question AskersBeing an Ethical Data ScientistCareer Advice

Content preview from Doing Data Science

Chapter 11. Causality

Many of the models and examples in the book so far have been focused on the fundamental problem of prediction. We’ve discussed examples like in Chapter 8, where your goal was to build a model to predict whether or not a person would be likely to prefer a certain item—a movie or a book, for example. There may be thousands of features that go into the model, and you may use feature selection to narrow those down, but ultimately the model is getting optimized in order to get the highest accuracy. When one is optimizing for accuracy, one doesn’t necessarily worry about the meaning or interpretation of the features, and especially if there are thousands of features, it’s well-near impossible to interpret at all.

Additionally, you wouldn’t even want to make the statement that certain characteristics caused the person to buy the item. So, for example, your model for predicting or recommending a book on Amazon could include a feature “whether or not you’ve read Wes McKinney’s O’Reilly book Python for Data Analysis.” We wouldn’t say that reading his book caused you to read this book. It just might be a good predictor, which would have been discovered and come out as such in the process of optimizing for accuracy. We wish to emphasize here that it’s not simply the familiar correlation-causation trade-off you’ve perhaps had drilled into your head already, but rather that your intent when building such a model or system was not even to understand causality at all, but ...