book

Data Science for Business

by Foster Provost, Tom Fawcett

August 2013

Beginner to intermediate

414 pages

13h 2m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Praise
Dedication
Preface
Our Conceptual Approach to Data ScienceTo the InstructorOther Skills and ConceptsSections and NotationUsing ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction: Data-Analytic Thinking
The Ubiquity of Data OpportunitiesExample: Hurricane FrancesExample: Predicting Customer ChurnData Science, Engineering, and Data-Driven Decision MakingData Processing and “Big Data”From Big Data 1.0 to Big Data 2.0Data and Data Science Capability as a Strategic AssetData-Analytic ThinkingThis BookData Mining and Data Science, RevisitedChemistry Is Not About Test Tubes: Data Science Versus the Work of the Data ScientistSummary
2. Business Problems and Data Science Solutions
From Business Problems to Data Mining TasksSupervised Versus Unsupervised MethodsData Mining and Its ResultsThe Data Mining ProcessBusiness UnderstandingData UnderstandingData PreparationModelingEvaluationDeploymentImplications for Managing the Data Science TeamOther Analytics Techniques and TechnologiesStatisticsDatabase QueryingData WarehousingRegression AnalysisMachine Learning and Data MiningAnswering Business Questions with These TechniquesSummary
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Models, Induction, and PredictionSupervised SegmentationSelecting Informative AttributesExample: Attribute Selection with Information GainSupervised Segmentation with Tree-Structured ModelsVisualizing SegmentationsTrees as Sets of RulesProbability EstimationExample: Addressing the Churn Problem with Tree InductionSummary
4. Fitting a Model to Data
Classification via Mathematical FunctionsLinear Discriminant FunctionsOptimizing an Objective FunctionAn Example of Mining a Linear Discriminant from DataLinear Discriminant Functions for Scoring and Ranking InstancesSupport Vector Machines, BrieflyRegression via Mathematical FunctionsClass Probability Estimation and Logistic “Regression”* Logistic Regression: Some Technical DetailsExample: Logistic Regression versus Tree InductionNonlinear Functions, Support Vector Machines, and Neural NetworksSummary
5. Overfitting and Its Avoidance
GeneralizationOverfittingOverfitting ExaminedHoldout Data and Fitting GraphsOverfitting in Tree InductionOverfitting in Mathematical FunctionsExample: Overfitting Linear Functions* Example: Why Is Overfitting Bad?From Holdout Evaluation to Cross-ValidationThe Churn Dataset RevisitedLearning CurvesOverfitting Avoidance and Complexity ControlAvoiding Overfitting with Tree InductionA General Method for Avoiding Overfitting* Avoiding Overfitting for Parameter OptimizationSummary
6. Similarity, Neighbors, and Clusters
Similarity and DistanceNearest-Neighbor ReasoningExample: Whiskey AnalyticsNearest Neighbors for Predictive ModelingClassificationProbability EstimationRegressionHow Many Neighbors and How Much Influence?Geometric Interpretation, Overfitting, and Complexity ControlIssues with Nearest-Neighbor MethodsIntelligibilityDimensionality and domain knowledgeComputational efficiencySome Important Technical Details Relating to Similarities and NeighborsHeterogeneous Attributes* Other Distance Functions* Combining Functions: Calculating Scores from NeighborsClusteringExample: Whiskey Analytics RevisitedHierarchical ClusteringNearest Neighbors Revisited: Clustering Around CentroidsExample: Clustering Business News StoriesData preparationThe news story clustersUnderstanding the Results of Clustering* Using Supervised Learning to Generate Cluster DescriptionsStepping Back: Solving a Business Problem Versus Data ExplorationSummary
7. Decision Analytic Thinking I: What Is a Good Model?
Evaluating ClassifiersPlain Accuracy and Its ProblemsThe Confusion MatrixProblems with Unbalanced ClassesProblems with Unequal Costs and BenefitsGeneralizing Beyond ClassificationA Key Analytical Framework: Expected ValueUsing Expected Value to Frame Classifier UseUsing Expected Value to Frame Classifier EvaluationError ratesCosts and benefitsEvaluation, Baseline Performance, and Implications for Investments in DataSummary

8. Visualizing Model Performance
Ranking Instead of ClassifyingProfit CurvesROC Graphs and CurvesThe Area Under the ROC Curve (AUC)Cumulative Response and Lift CurvesExample: Performance Analytics for Churn ModelingSummary
9. Evidence and Probabilities
Example: Targeting Online Consumers With AdvertisementsCombining Evidence ProbabilisticallyJoint Probability and IndependenceBayes’ RuleApplying Bayes’ Rule to Data ScienceConditional Independence and Naive BayesAdvantages and Disadvantages of Naive BayesA Model of Evidence “Lift”Example: Evidence Lifts from Facebook “Likes”Evidence in Action: Targeting Consumers with AdsSummary
10. Representing and Mining Text
Why Text Is ImportantWhy Text Is DifficultRepresentationBag of WordsTerm FrequencyMeasuring Sparseness: Inverse Document FrequencyCombining Them: TFIDFExample: Jazz Musicians* The Relationship of IDF to EntropyBeyond Bag of WordsN-gram SequencesNamed Entity ExtractionTopic ModelsExample: Mining News Stories to Predict Stock Price MovementThe TaskThe DataData PreprocessingResultsSummary
11. Decision Analytic Thinking II: Toward Analytical Engineering
Targeting the Best Prospects for a Charity MailingThe Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution PiecesA Brief Digression on Selection BiasOur Churn Example Revisited with Even More SophisticationThe Expected Value Framework: Structuring a More Complicated Business ProblemAssessing the Influence of the IncentiveFrom an Expected Value Decomposition to a Data Science SolutionSummary
12. Other Data Science Tasks and Techniques
Co-occurrences and Associations: Finding Items That Go TogetherMeasuring Surprise: Lift and LeverageExample: Beer and Lottery TicketsAssociations Among Facebook LikesProfiling: Finding Typical BehaviorLink Prediction and Social RecommendationData Reduction, Latent Information, and Movie RecommendationBias, Variance, and Ensemble MethodsData-Driven Causal Explanation and a Viral Marketing ExampleSummary
13. Data Science and Business Strategy
Thinking Data-Analytically, ReduxAchieving Competitive Advantage with Data ScienceSustaining Competitive Advantage with Data ScienceFormidable Historical AdvantageUnique Intellectual PropertyUnique Intangible Collateral AssetsSuperior Data ScientistsSuperior Data Science ManagementAttracting and Nurturing Data Scientists and Their TeamsExamine Data Science Case StudiesBe Ready to Accept Creative Ideas from Any SourceBe Ready to Evaluate Proposals for Data Science ProjectsExample Data Mining ProposalFlaws in the Big Red ProposalA Firm’s Data Science Maturity
14. Conclusion
The Fundamental Concepts of Data ScienceApplying Our Fundamental Concepts to a New Problem: Mining Mobile Device DataChanging the Way We Think about Solutions to Business ProblemsWhat Data Can’t Do: Humans in the Loop, RevisitedPrivacy, Ethics, and Mining Data About IndividualsIs There More to Data Science?Final Example: From Crowd-Sourcing to Cloud-SourcingFinal Words
A. Proposal Review Guide
Business and Data UnderstandingData PreparationModelingEvaluation and Deployment
B. Another Sample Proposal
Scenario and ProposalFlaws in the GGC Proposal
Glossary
C. Bibliography
Index
Colophon
Copyright

Content preview from Data Science for Business

Preface

Foster Provost

Tom Fawcett

Data Science for Business is intended for several sorts of readers:

Business people who will be working with data scientists, managing data science–oriented projects, or investing in data science ventures,
Developers who will be implementing data science solutions, and
Aspiring data scientists.

This is not a book about algorithms, nor is it a replacement for a book about algorithms. We deliberately avoided an algorithm-centered approach. We believe there is a relatively small set of fundamental concepts or principles that underlie techniques for extracting useful knowledge from data. These concepts serve as the foundation for many well-known algorithms of data mining. Moreover, these concepts underlie the analysis of data-centered business problems, the creation and evaluation of data science solutions, and the evaluation of general data science strategies and proposals. Accordingly, we organized the exposition around these general principles rather than around specific algorithms. Where necessary to describe procedural details, we use a combination of text and diagrams, which we think are more accessible than a listing of detailed algorithmic steps.

The book does not presume a sophisticated mathematical background. However, by its very nature the material is somewhat technical—the goal is to impart a significant understanding of data science, not just to give a high-level overview. In general, we have tried to minimize the mathematics and make ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449374273Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science for Business

by Foster Provost, Tom Fawcett