book

Doing Data Science

by Cathy O'Neil, Rachel Schutt

October 2013

Beginner

405 pages

10h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
MotivationOrigins of the ClassOrigins of the BookWhat to Expect from This BookHow This Book Is OrganizedHow to Read This BookHow Code Is Used in This BookWho This Book Is ForPrerequisitesSupplemental ReadingAbout the ContributorsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction: What Is Data Science?
Big Data and Data Science HypeGetting Past the HypeWhy Now?DataficationThe Current Landscape (with a Little History)Data Science JobsA Data Science ProfileThought Experiment: Meta-DefinitionOK, So What Is a Data Scientist, Really?In AcademiaIn Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big DataStatistical InferencePopulations and SamplesPopulations and Samples of Big DataBig Data Can Mean Big AssumptionsModelingExploratory Data AnalysisPhilosophy of Exploratory Data AnalysisExercise: EDAThe Data Science ProcessA Data Scientist’s Role in This ProcessThought Experiment: How Would You Simulate Chaos?Case Study: RealDirectHow Does RealDirect Make Money?Exercise: RealDirect Data Strategy
3. Algorithms
Machine Learning AlgorithmsThree Basic AlgorithmsLinear Regressionk-Nearest Neighbors (k-NN)k-meansExercise: Basic Machine Learning AlgorithmsSolutionsSumming It All UpThought Experiment: Automated Statistician
4. Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by ExampleWhy Won’t Linear Regression Work for Filtering Spam?How About k-nearest Neighbors?Naive BayesBayes LawA Spam Filter for Individual WordsA Spam Filter That Combines Words: Naive BayesFancy It Up: Laplace SmoothingComparing Naive Bayes to k-NNSample Code in bashScraping the Web: APIs and Other ToolsJake’s Exercise: Naive Bayes for Article ClassificationSample R Code for Dealing with the NYT API
5. Logistic Regression
Thought ExperimentsClassifiersRuntimeYouInterpretabilityScalabilityM6D Logistic Regression Case StudyClick ModelsThe Underlying MathEstimating α and βNewton’s MethodStochastic Gradient DescentImplementationEvaluationMedia 6 Degrees ExerciseSample R Code
6. Time Stamps and Financial Modeling
Kyle Teague and GetGlueTimestampsExploratory Data Analysis (EDA)Metrics and New Variables or FeaturesWhat’s Next?Cathy O’NeilThought ExperimentFinancial ModelingIn-Sample, Out-of-Sample, and CausalityPreparing Financial DataLog ReturnsExample: The S&P IndexWorking out a Volatility MeasurementExponential DownweightingThe Financial Modeling Feedback LoopWhy Regression?Adding PriorsA Baby ModelExercise: GetGlue and Timestamped Event DataExercise: Financial Data
7. Extracting Meaning from Data
William CukierskiBackground: Data Science CompetitionsBackground: CrowdsourcingThe Kaggle ModelA Single ContestantTheir CustomersThought Experiment: What Are the Ethical Implications of a Robo-Grader?Feature SelectionExample: User RetentionFiltersWrappersEmbedded Methods: Decision TreesEntropyThe Decision Tree AlgorithmHandling Continuous Variables in Decision TreesRandom ForestsUser Retention: Interpretability Versus Predictive PowerDavid Huffaker: Google’s Hybrid Approach to Social ResearchMoving from Descriptive to PredictiveSocial at GooglePrivacyThought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
8. Recommendation Engines: Building a User-Facing Data Product at Scale
A Real-World Recommendation EngineNearest Neighbor Algorithm ReviewSome Problems with Nearest NeighborsBeyond Nearest Neighbor: Machine Learning ClassificationThe Dimensionality ProblemSingular Value Decomposition (SVD)Important Properties of SVDPrincipal Component Analysis (PCA)Alternating Least SquaresFix V and Update ULast Thoughts on These AlgorithmsThought Experiment: Filter BubblesExercise: Build Your Own Recommendation SystemSample Code in Python
9. Data Visualization and Fraud Detection
Data Visualization HistoryGabriel TardeMark’s Thought ExperimentWhat Is Data Science, Redux?ProcessingFranco MorettiA Sample of Data Visualization ProjectsMark’s Data Visualization ProjectsNew York Times Lobby: Moveable TypeProject Cascade: Lives on a ScreenCronkite PlazaeBay Transactions and BooksPublic Theater Shakespeare MachineGoals of These ExhibitsData Science and RiskAbout SquareThe Risk ChallengeThe Trouble with Performance EstimationModel Building TipsData Visualization at SquareIan’s Thought ExperimentData Visualization for the Rest of UsData Visualization Exercise

10. Social Networks and Data Journalism
Social Network Analysis at Morning AnalyticsCase-Attribute Data versus Social Network DataSocial Network AnalysisTerminology from Social NetworksCentrality MeasuresThe Industry of Centrality MeasuresThought ExperimentMorningside AnalyticsHow Visualizations Help Us Find Schools of FishMore Background on Social Network Analysis from a Statistical Point of ViewRepresentations of Networks and Eigenvalue CentralityA First Example of Random Graphs: The Erdos-Renyi ModelA Second Example of Random Graphs: The Exponential Random Graph ModelData JournalismA Bit of History on Data JournalismWriting Technical Journalism: Advice from an Expert
11. Causality
Correlation Doesn’t Imply CausationAsking Causal QuestionsConfounders: A Dating ExampleOK Cupid’s AttemptThe Gold Standard: Randomized Clinical TrialsA/B TestsSecond Best: Observational StudiesSimpson’s ParadoxThe Rubin Causal ModelVisualizing CausalityDefinition: The Causal EffectThree Pieces of Advice
12. Epidemiology
Madigan’s BackgroundThought ExperimentModern Academic StatisticsMedical Literature and Observational StudiesStratification Does Not Solve the Confounder ProblemWhat Do People Do About Confounding Things in Practice?Is There a Better Way?Research Experiment (Observational Medical Outcomes Partnership)Closing Thought Experiment
13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Claudia’s Data Scientist ProfileThe Life of a Chief Data ScientistOn Being a Female Data ScientistData Mining CompetitionsHow to Be a Good ModelerData LeakageMarket PredictionsAmazon Case Study: Big SpendersA Jewelry Sampling ProblemIBM Customer TargetingBreast Cancer DetectionPneumonia PredictionHow to Avoid LeakageEvaluating ModelsAccuracy: MehProbabilities Matter, Not 0s and 1sChoosing an AlgorithmA Final ExampleParting Thoughts
14. Data Engineering: MapReduce, Pregel, and Hadoop
About David CrawshawThought ExperimentMapReduceWord Frequency ProblemEnter MapReduceOther Examples of MapReduceWhat Can’t MapReduce Do?PregelAbout Josh WillsThought ExperimentOn Being a Data ScientistData Abundance Versus Data ScarcityDesigning ModelsEconomic Interlude: HadoopA Brief Introduction to HadoopClouderaBack to Josh: WorkflowSo How to Get Started with Hadoop?
15. The Students Speak
Process ThinkingNaive No LongerHelping HandsYour Mileage May VaryBridging TunnelsSome of Our Work
16. Next-Generation Data Scientists, Hubris, and Ethics
What Just Happened?What Is Data Science (Again)?What Are Next-Gen Data Scientists?Being Problem SolversCultivating Soft SkillsBeing Question AskersBeing an Ethical Data ScientistCareer Advice
Index

Overview

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

Topics include:

Statistical inference, exploratory data analysis, and the data science process
Algorithms
Spam filters, Naive Bayes, and data wrangling
Logistic regression
Financial modeling
Recommendation engines and causality
Data visualization
Social networks and data journalism
Data engineering, MapReduce, Pregel, and Hadoop

Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449363871Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills