book

Data Science for Business

by Foster Provost, Tom Fawcett

August 2013

Beginner to intermediate

414 pages

13h 2m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Praise
Dedication
Preface
Our Conceptual Approach to Data ScienceTo the InstructorOther Skills and ConceptsSections and NotationUsing ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction: Data-Analytic Thinking
The Ubiquity of Data OpportunitiesExample: Hurricane FrancesExample: Predicting Customer ChurnData Science, Engineering, and Data-Driven Decision MakingData Processing and “Big Data”From Big Data 1.0 to Big Data 2.0Data and Data Science Capability as a Strategic AssetData-Analytic ThinkingThis BookData Mining and Data Science, RevisitedChemistry Is Not About Test Tubes: Data Science Versus the Work of the Data ScientistSummary
2. Business Problems and Data Science Solutions
From Business Problems to Data Mining TasksSupervised Versus Unsupervised MethodsData Mining and Its ResultsThe Data Mining ProcessBusiness UnderstandingData UnderstandingData PreparationModelingEvaluationDeploymentImplications for Managing the Data Science TeamOther Analytics Techniques and TechnologiesStatisticsDatabase QueryingData WarehousingRegression AnalysisMachine Learning and Data MiningAnswering Business Questions with These TechniquesSummary
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Models, Induction, and PredictionSupervised SegmentationSelecting Informative AttributesExample: Attribute Selection with Information GainSupervised Segmentation with Tree-Structured ModelsVisualizing SegmentationsTrees as Sets of RulesProbability EstimationExample: Addressing the Churn Problem with Tree InductionSummary
4. Fitting a Model to Data
Classification via Mathematical FunctionsLinear Discriminant FunctionsOptimizing an Objective FunctionAn Example of Mining a Linear Discriminant from DataLinear Discriminant Functions for Scoring and Ranking InstancesSupport Vector Machines, BrieflyRegression via Mathematical FunctionsClass Probability Estimation and Logistic “Regression”* Logistic Regression: Some Technical DetailsExample: Logistic Regression versus Tree InductionNonlinear Functions, Support Vector Machines, and Neural NetworksSummary
5. Overfitting and Its Avoidance
GeneralizationOverfittingOverfitting ExaminedHoldout Data and Fitting GraphsOverfitting in Tree InductionOverfitting in Mathematical FunctionsExample: Overfitting Linear Functions* Example: Why Is Overfitting Bad?From Holdout Evaluation to Cross-ValidationThe Churn Dataset RevisitedLearning CurvesOverfitting Avoidance and Complexity ControlAvoiding Overfitting with Tree InductionA General Method for Avoiding Overfitting* Avoiding Overfitting for Parameter OptimizationSummary
6. Similarity, Neighbors, and Clusters
Similarity and DistanceNearest-Neighbor ReasoningExample: Whiskey AnalyticsNearest Neighbors for Predictive ModelingClassificationProbability EstimationRegressionHow Many Neighbors and How Much Influence?Geometric Interpretation, Overfitting, and Complexity ControlIssues with Nearest-Neighbor MethodsIntelligibilityDimensionality and domain knowledgeComputational efficiencySome Important Technical Details Relating to Similarities and NeighborsHeterogeneous Attributes* Other Distance Functions* Combining Functions: Calculating Scores from NeighborsClusteringExample: Whiskey Analytics RevisitedHierarchical ClusteringNearest Neighbors Revisited: Clustering Around CentroidsExample: Clustering Business News StoriesData preparationThe news story clustersUnderstanding the Results of Clustering* Using Supervised Learning to Generate Cluster DescriptionsStepping Back: Solving a Business Problem Versus Data ExplorationSummary
7. Decision Analytic Thinking I: What Is a Good Model?
Evaluating ClassifiersPlain Accuracy and Its ProblemsThe Confusion MatrixProblems with Unbalanced ClassesProblems with Unequal Costs and BenefitsGeneralizing Beyond ClassificationA Key Analytical Framework: Expected ValueUsing Expected Value to Frame Classifier UseUsing Expected Value to Frame Classifier EvaluationError ratesCosts and benefitsEvaluation, Baseline Performance, and Implications for Investments in DataSummary

8. Visualizing Model Performance
Ranking Instead of ClassifyingProfit CurvesROC Graphs and CurvesThe Area Under the ROC Curve (AUC)Cumulative Response and Lift CurvesExample: Performance Analytics for Churn ModelingSummary
9. Evidence and Probabilities
Example: Targeting Online Consumers With AdvertisementsCombining Evidence ProbabilisticallyJoint Probability and IndependenceBayes’ RuleApplying Bayes’ Rule to Data ScienceConditional Independence and Naive BayesAdvantages and Disadvantages of Naive BayesA Model of Evidence “Lift”Example: Evidence Lifts from Facebook “Likes”Evidence in Action: Targeting Consumers with AdsSummary
10. Representing and Mining Text
Why Text Is ImportantWhy Text Is DifficultRepresentationBag of WordsTerm FrequencyMeasuring Sparseness: Inverse Document FrequencyCombining Them: TFIDFExample: Jazz Musicians* The Relationship of IDF to EntropyBeyond Bag of WordsN-gram SequencesNamed Entity ExtractionTopic ModelsExample: Mining News Stories to Predict Stock Price MovementThe TaskThe DataData PreprocessingResultsSummary
11. Decision Analytic Thinking II: Toward Analytical Engineering
Targeting the Best Prospects for a Charity MailingThe Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution PiecesA Brief Digression on Selection BiasOur Churn Example Revisited with Even More SophisticationThe Expected Value Framework: Structuring a More Complicated Business ProblemAssessing the Influence of the IncentiveFrom an Expected Value Decomposition to a Data Science SolutionSummary
12. Other Data Science Tasks and Techniques
Co-occurrences and Associations: Finding Items That Go TogetherMeasuring Surprise: Lift and LeverageExample: Beer and Lottery TicketsAssociations Among Facebook LikesProfiling: Finding Typical BehaviorLink Prediction and Social RecommendationData Reduction, Latent Information, and Movie RecommendationBias, Variance, and Ensemble MethodsData-Driven Causal Explanation and a Viral Marketing ExampleSummary
13. Data Science and Business Strategy
Thinking Data-Analytically, ReduxAchieving Competitive Advantage with Data ScienceSustaining Competitive Advantage with Data ScienceFormidable Historical AdvantageUnique Intellectual PropertyUnique Intangible Collateral AssetsSuperior Data ScientistsSuperior Data Science ManagementAttracting and Nurturing Data Scientists and Their TeamsExamine Data Science Case StudiesBe Ready to Accept Creative Ideas from Any SourceBe Ready to Evaluate Proposals for Data Science ProjectsExample Data Mining ProposalFlaws in the Big Red ProposalA Firm’s Data Science Maturity
14. Conclusion
The Fundamental Concepts of Data ScienceApplying Our Fundamental Concepts to a New Problem: Mining Mobile Device DataChanging the Way We Think about Solutions to Business ProblemsWhat Data Can’t Do: Humans in the Loop, RevisitedPrivacy, Ethics, and Mining Data About IndividualsIs There More to Data Science?Final Example: From Crowd-Sourcing to Cloud-SourcingFinal Words
A. Proposal Review Guide
Business and Data UnderstandingData PreparationModelingEvaluation and Deployment
B. Another Sample Proposal
Scenario and ProposalFlaws in the GGC Proposal
Glossary
C. Bibliography
Index
Colophon
Copyright

Content preview from Data Science for Business

Chapter 6. Similarity, Neighbors, and Clusters

Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation.
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity.

Similarity underlies many data science methods and solutions to business problems. If two things (people, companies, products) are similar in some ways they often share other characteristics as well. Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. We saw this implicitly in previous chapters where modeling procedures create boundaries for grouping instances together that have similar values for their target variables. In this chapter we will look at similarity directly, and show how it applies to a variety of different tasks. We include sections with some technical details, in order that the more mathematical reader can understand similarity in more depth; these sections can be skipped.

Different sorts of business tasks involve reasoning from similar examples:

We may want to retrieve similar things directly. For example, IBM wants to find companies that are similar to their best business customers, in order to have the sales staff look at them as prospects. Hewlett-Packard maintains many high-performance servers for clients; this maintenance is aided by a tool that, given ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449374273Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science for Business

by Foster Provost, Tom Fawcett

Chapter 6. Similarity, Neighbors, and Clusters

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.