book

Data Science for Business

by Foster Provost, Tom Fawcett

August 2013

Beginner to intermediate

414 pages

13h 2m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Praise
Dedication
Preface
Our Conceptual Approach to Data ScienceTo the InstructorOther Skills and ConceptsSections and NotationUsing ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction: Data-Analytic Thinking
The Ubiquity of Data OpportunitiesExample: Hurricane FrancesExample: Predicting Customer ChurnData Science, Engineering, and Data-Driven Decision MakingData Processing and “Big Data”From Big Data 1.0 to Big Data 2.0Data and Data Science Capability as a Strategic AssetData-Analytic ThinkingThis BookData Mining and Data Science, RevisitedChemistry Is Not About Test Tubes: Data Science Versus the Work of the Data ScientistSummary
2. Business Problems and Data Science Solutions
From Business Problems to Data Mining TasksSupervised Versus Unsupervised MethodsData Mining and Its ResultsThe Data Mining ProcessBusiness UnderstandingData UnderstandingData PreparationModelingEvaluationDeploymentImplications for Managing the Data Science TeamOther Analytics Techniques and TechnologiesStatisticsDatabase QueryingData WarehousingRegression AnalysisMachine Learning and Data MiningAnswering Business Questions with These TechniquesSummary
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Models, Induction, and PredictionSupervised SegmentationSelecting Informative AttributesExample: Attribute Selection with Information GainSupervised Segmentation with Tree-Structured ModelsVisualizing SegmentationsTrees as Sets of RulesProbability EstimationExample: Addressing the Churn Problem with Tree InductionSummary
4. Fitting a Model to Data
Classification via Mathematical FunctionsLinear Discriminant FunctionsOptimizing an Objective FunctionAn Example of Mining a Linear Discriminant from DataLinear Discriminant Functions for Scoring and Ranking InstancesSupport Vector Machines, BrieflyRegression via Mathematical FunctionsClass Probability Estimation and Logistic “Regression”* Logistic Regression: Some Technical DetailsExample: Logistic Regression versus Tree InductionNonlinear Functions, Support Vector Machines, and Neural NetworksSummary
5. Overfitting and Its Avoidance
GeneralizationOverfittingOverfitting ExaminedHoldout Data and Fitting GraphsOverfitting in Tree InductionOverfitting in Mathematical FunctionsExample: Overfitting Linear Functions* Example: Why Is Overfitting Bad?From Holdout Evaluation to Cross-ValidationThe Churn Dataset RevisitedLearning CurvesOverfitting Avoidance and Complexity ControlAvoiding Overfitting with Tree InductionA General Method for Avoiding Overfitting* Avoiding Overfitting for Parameter OptimizationSummary
6. Similarity, Neighbors, and Clusters
Similarity and DistanceNearest-Neighbor ReasoningExample: Whiskey AnalyticsNearest Neighbors for Predictive ModelingClassificationProbability EstimationRegressionHow Many Neighbors and How Much Influence?Geometric Interpretation, Overfitting, and Complexity ControlIssues with Nearest-Neighbor MethodsIntelligibilityDimensionality and domain knowledgeComputational efficiencySome Important Technical Details Relating to Similarities and NeighborsHeterogeneous Attributes* Other Distance Functions* Combining Functions: Calculating Scores from NeighborsClusteringExample: Whiskey Analytics RevisitedHierarchical ClusteringNearest Neighbors Revisited: Clustering Around CentroidsExample: Clustering Business News StoriesData preparationThe news story clustersUnderstanding the Results of Clustering* Using Supervised Learning to Generate Cluster DescriptionsStepping Back: Solving a Business Problem Versus Data ExplorationSummary
7. Decision Analytic Thinking I: What Is a Good Model?
Evaluating ClassifiersPlain Accuracy and Its ProblemsThe Confusion MatrixProblems with Unbalanced ClassesProblems with Unequal Costs and BenefitsGeneralizing Beyond ClassificationA Key Analytical Framework: Expected ValueUsing Expected Value to Frame Classifier UseUsing Expected Value to Frame Classifier EvaluationError ratesCosts and benefitsEvaluation, Baseline Performance, and Implications for Investments in DataSummary

8. Visualizing Model Performance
Ranking Instead of ClassifyingProfit CurvesROC Graphs and CurvesThe Area Under the ROC Curve (AUC)Cumulative Response and Lift CurvesExample: Performance Analytics for Churn ModelingSummary
9. Evidence and Probabilities
Example: Targeting Online Consumers With AdvertisementsCombining Evidence ProbabilisticallyJoint Probability and IndependenceBayes’ RuleApplying Bayes’ Rule to Data ScienceConditional Independence and Naive BayesAdvantages and Disadvantages of Naive BayesA Model of Evidence “Lift”Example: Evidence Lifts from Facebook “Likes”Evidence in Action: Targeting Consumers with AdsSummary
10. Representing and Mining Text
Why Text Is ImportantWhy Text Is DifficultRepresentationBag of WordsTerm FrequencyMeasuring Sparseness: Inverse Document FrequencyCombining Them: TFIDFExample: Jazz Musicians* The Relationship of IDF to EntropyBeyond Bag of WordsN-gram SequencesNamed Entity ExtractionTopic ModelsExample: Mining News Stories to Predict Stock Price MovementThe TaskThe DataData PreprocessingResultsSummary
11. Decision Analytic Thinking II: Toward Analytical Engineering
Targeting the Best Prospects for a Charity MailingThe Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution PiecesA Brief Digression on Selection BiasOur Churn Example Revisited with Even More SophisticationThe Expected Value Framework: Structuring a More Complicated Business ProblemAssessing the Influence of the IncentiveFrom an Expected Value Decomposition to a Data Science SolutionSummary
12. Other Data Science Tasks and Techniques
Co-occurrences and Associations: Finding Items That Go TogetherMeasuring Surprise: Lift and LeverageExample: Beer and Lottery TicketsAssociations Among Facebook LikesProfiling: Finding Typical BehaviorLink Prediction and Social RecommendationData Reduction, Latent Information, and Movie RecommendationBias, Variance, and Ensemble MethodsData-Driven Causal Explanation and a Viral Marketing ExampleSummary
13. Data Science and Business Strategy
Thinking Data-Analytically, ReduxAchieving Competitive Advantage with Data ScienceSustaining Competitive Advantage with Data ScienceFormidable Historical AdvantageUnique Intellectual PropertyUnique Intangible Collateral AssetsSuperior Data ScientistsSuperior Data Science ManagementAttracting and Nurturing Data Scientists and Their TeamsExamine Data Science Case StudiesBe Ready to Accept Creative Ideas from Any SourceBe Ready to Evaluate Proposals for Data Science ProjectsExample Data Mining ProposalFlaws in the Big Red ProposalA Firm’s Data Science Maturity
14. Conclusion
The Fundamental Concepts of Data ScienceApplying Our Fundamental Concepts to a New Problem: Mining Mobile Device DataChanging the Way We Think about Solutions to Business ProblemsWhat Data Can’t Do: Humans in the Loop, RevisitedPrivacy, Ethics, and Mining Data About IndividualsIs There More to Data Science?Final Example: From Crowd-Sourcing to Cloud-SourcingFinal Words
A. Proposal Review Guide
Business and Data UnderstandingData PreparationModelingEvaluation and Deployment
B. Another Sample Proposal
Scenario and ProposalFlaws in the GGC Proposal
Glossary
C. Bibliography
Index
Colophon
Copyright

Content preview from Data Science for Business

Chapter 1. Introduction: Data-Analytic Thinking

Dream no small dreams for they have no power to move the hearts of men.

—Johann Wolfgang von Goethe

The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every aspect of business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behavior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors’ movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data—the realm of data science.

The Ubiquity of Data Opportunities

With vast amounts of data now available, companies in almost every industry are focused on exploiting data for competitive advantage. In the past, firms could employ teams of statisticians, modelers, and analysts to explore datasets manually, but the volume and variety of data have far outstripped the capacity of manual analysis. At the same time, computers have become far more powerful, networking has become ubiquitous, and algorithms have been developed that can connect datasets to enable broader and deeper analyses than previously possible. The convergence of these phenomena has given rise to the increasingly ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449374273Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science for Business

by Foster Provost, Tom Fawcett

Chapter 1. Introduction: Data-Analytic Thinking

The Ubiquity of Data Opportunities

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.