book

Data Mining and Predictive Analytics, 2nd Edition

by Chantal D. Larose, Daniel T. Larose

March 2015

Beginner to intermediate

824 pages

22h 57m

English

Wiley

Read now

Unlock full access

Cover
Series
Title Page
Copyright
Dedication
Preface
What is Data Mining? What is Predictive Analytics?Why is this Book Needed?Who Will Benefit from this Book?Danger! Data Mining is Easy to do Badly“White-Box” ApproachAlgorithm Walk-ThroughsExciting New TopicsThe R ZoneAppendix: Data Summarization and VisualizationThe Case Study: Bringing it all TogetherHow the Book is StructuredThe SoftwareWeka: The Open-Source AlternativeThe Companion Web Site: www.dataminingconsultant.comData Mining and Predictive Analytics as a Textbook
Acknowledgments
Daniel's AcknowledgmentsChantal's Acknowledgments
Part I: Data Preparation
Chapter 1: An Introduction to Data Mining and Predictive Analytics
1.1 What is Data Mining? What Is Predictive Analytics?1.2 Wanted: Data Miners1.3 The Need For Human Direction of Data Mining1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM1.5 Fallacies of Data Mining1.6 What Tasks can Data Mining AccomplishThe R ZoneR ReferencesExercises
Chapter 2: Data Preprocessing
2.1 Why do We Need to Preprocess the Data?2.2 Data Cleaning2.3 Handling Missing Data2.4 Identifying Misclassifications2.5 Graphical Methods for Identifying Outliers2.6 Measures of Center and Spread2.7 Data Transformation2.8 Min–Max Normalization2.9 Z-Score Standardization2.10 Decimal Scaling2.11 Transformations to Achieve Normality2.12 Numerical Methods for Identifying Outliers2.13 Flag Variables2.14 Transforming Categorical Variables into Numerical Variables2.15 Binning Numerical Variables2.16 Reclassifying Categorical Variables2.17 Adding an Index Field2.18 Removing Variables that are not Useful2.19 Variables that Should Probably not be Removed2.20 Removal of Duplicate Records2.21 A Word About ID FieldsThe R ZoneR ReferenceExercises

Chapter 3: Exploratory Data Analysis
3.1 Hypothesis Testing Versus Exploratory Data Analysis3.2 Getting to Know The Data Set3.3 Exploring Categorical Variables3.4 Exploring Numeric Variables3.5 Exploring Multivariate Relationships3.6 Selecting Interesting Subsets of the Data for Further Investigation3.7 Using EDA to Uncover Anomalous Fields3.8 Binning Based on Predictive Value3.9 Deriving New Variables: Flag Variables3.10 Deriving New Variables: Numerical Variables3.11 Using EDA to Investigate Correlated Predictor Variables3.12 Summary of Our EDAThe R ZoneR ReferencesExercises
Chapter 4: Dimension-Reduction Methods
4.1 Need for Dimension-Reduction in Data Mining4.2 Principal Components Analysis4.3 Applying PCA to the Houses Data Set4.4 How Many Components Should We Extract?4.5 Profiling the Principal Components4.6 Communalities4.7 Validation of the Principal Components4.8 Factor Analysis4.9 Applying Factor Analysis to the Adult Data Set4.10 Factor Rotation4.11 User-Defined Composites4.12 An Example of a User-Defined CompositeThe R ZoneR ReferencesExercises
Part II: Statistical Analysis
Chapter 5: Univariate Statistical Analysis
5.1 Data Mining Tasks in Discovering Knowledge in Data5.2 Statistical Approaches to Estimation and Prediction5.3 Statistical Inference5.4 How Confident are We in Our Estimates?5.5 Confidence Interval Estimation of the Mean5.6 How to Reduce the Margin of Error5.7 Confidence Interval Estimation of the Proportion5.8 Hypothesis Testing for the Mean5.9 Assessing The Strength of Evidence Against The Null Hypothesis5.10 Using Confidence Intervals to Perform Hypothesis Tests5.11 Hypothesis Testing for The ProportionReferenceThe R ZoneR ReferenceExercises
Chapter 6: Multivariate Statistics
6.1 Two-Sample t-Test for Difference in Means6.2 Two-Sample Z-Test for Difference in Proportions6.3 Test for the Homogeneity of Proportions6.4 Chi-Square Test for Goodness of Fit of Multinomial Data6.5 Analysis of VarianceReferenceThe R ZoneR ReferenceExercises
Chapter 7: Preparing to Model the Data
7.1 Supervised Versus Unsupervised Methods7.2 Statistical Methodology and Data Mining Methodology7.3 Cross-Validation7.4 Overfitting7.5 Bias–Variance Trade-Off7.6 Balancing The Training Data Set7.7 Establishing Baseline PerformanceThe R ZoneR ReferenceExercises
Chapter 8: Simple Linear Regression
8.1 An Example of Simple Linear Regression8.2 Dangers of Extrapolation8.3 How Useful is the Regression? The Coefficient of Determination, 28.4 Standard Error of the Estimate,8.5 Correlation Coefficient8.6 Anova Table for Simple Linear Regression8.7 Outliers, High Leverage Points, and Influential Observations8.8 Population Regression Equation8.9 Verifying The Regression Assumptions8.10 Inference in Regression8.11 t-Test for the Relationship Between x and y8.12 Confidence Interval for the Slope of the Regression Line8.13 Confidence Interval for the Correlation Coefficient ρ8.14 Confidence Interval for the Mean Value of Given 8.15 Prediction Interval for a Randomly Chosen Value of Given 8.16 Transformations to Achieve Linearity8.17 Box–Cox TransformationsThe R ZoneR ReferencesExercises
Chapter 9: Multiple Regression and Model Building
9.1 An Example of Multiple Regression9.2 The Population Multiple Regression Equation9.3 Inference in Multiple Regression9.4 Regression With Categorical Predictors, Using Indicator Variables9.5 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful9.6 Sequential Sums of Squares9.7 Multicollinearity9.8 Variable Selection Methods9.9 Gas Mileage Data Set9.10 An Application of Variable Selection Methods9.11 Using the Principal Components as Predictors in Multiple RegressionThe R ZoneR ReferencesExercises
Part III: Classification
Chapter 10: k-Nearest Neighbor Algorithm
10.1 Classification Task10.2 k-Nearest Neighbor Algorithm10.3 Distance Function10.4 Combination Function10.5 Quantifying Attribute Relevance: Stretching the Axes10.6 Database Considerations10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction10.8 Choosing k10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS ModelerThe R ZoneR ReferencesExercises
Chapter 11: Decision Trees
11.1 What is a Decision Tree?11.2 Requirements for Using Decision Trees11.3 Classification and Regression Trees11.4 C4.5 Algorithm11.5 Decision Rules11.6 Comparison of the C5.0 and CART Algorithms Applied to Real DataThe R ZoneR ReferencesExercises
Chapter 12: Neural Networks
12.1 Input and Output Encoding12.2 Neural Networks for Estimation and Prediction12.3 Simple Example of a Neural Network12.4 Sigmoid Activation Function12.5 Back-Propagation12.6 Gradient-Descent Method12.7 Back-Propagation Rules12.8 Example of Back-Propagation12.9 Termination Criteria12.10 Learning Rate12.11 Momentum Term12.12 Sensitivity Analysis12.13 Application of Neural Network ModelingThe R ZoneR ReferencesExercises
Chapter 13: Logistic Regression
13.1 Simple Example of Logistic Regression13.2 Maximum Likelihood Estimation13.3 Interpreting Logistic Regression Output13.4 Inference: Are the Predictors Significant?13.5 Odds Ratio and Relative Risk13.6 Interpreting Logistic Regression for a Dichotomous Predictor13.7 Interpreting Logistic Regression for a Polychotomous Predictor13.8 Interpreting Logistic Regression for a Continuous Predictor13.9 Assumption of Linearity13.10 Zero-Cell Problem13.11 Multiple Logistic Regression13.12 Introducing Higher Order Terms to Handle Nonlinearity13.13 Validating the Logistic Regression Model13.14 WEKA: Hands-On Analysis Using Logistic RegressionThe R ZoneR ReferencesExercises
Chapter 14: NaÏVe Bayes and Bayesian Networks
14.1 Bayesian Approach14.2 Maximum A Posteriori (MAP) Classification14.3 Posterior Odds Ratio14.4 Balancing The Data14.5 Naïve Bayes Classification14.6 Interpreting The Log Posterior Odds Ratio14.7 Zero-Cell Problem14.8 Numeric Predictors for Naïve Bayes Classification14.9 WEKA: Hands-on Analysis Using Naïve Bayes14.10 Bayesian Belief Networks14.11 Clothing Purchase Example14.12 Using The Bayesian Network to Find ProbabilitiesThe R ZoneR ReferencesExercises
Chapter 15: Model Evaluation Techniques
15.1 Model Evaluation Techniques for the Description Task15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks15.3 Model Evaluation Measures for the Classification Task15.4 Accuracy and Overall Error Rate15.5 Sensitivity and Specificity15.6 False-Positive Rate and False-Negative Rate15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns15.9 Decision Cost/Benefit Analysis15.10 Lift Charts and Gains Charts15.11 Interweaving Model Evaluation with Model Building15.12 Confluence of Results: Applying a Suite of ModelsThe R ZoneR ReferencesExercisesHands-On Analysis
Chapter 16: Cost-Benefit Analysis Using Data-Driven Costs
16.1 Decision Invariance Under Row Adjustment16.2 Positive Classification Criterion16.3 Demonstration Of The Positive Classification Criterion16.4 Constructing The Cost Matrix16.5 Decision Invariance Under Scaling16.6 Direct Costs and Opportunity Costs16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs16.8 Rebalancing as a Surrogate for Misclassification CostsThe R ZoneR ReferencesExercises
Chapter 17: Cost-Benefit Analysis for Trinary and -Nary Classification Models
17.1 Classification Evaluation Measures for a Generic Trinary Target17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem17.4 Comparing Cart Models With and Without Data-Driven Misclassification Costs17.5 Classification Evaluation Measures for a Generic k-Nary Target17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary ClassificationThe R ZoneR ReferencesExercises
Chapter 18: Graphical Evaluation of Classification Models
18.1 Review of Lift Charts and Gains Charts18.2 Lift Charts and Gains Charts Using Misclassification Costs18.3 Response Charts18.4 Profits Charts18.5 Return on Investment (ROI) ChartsThe R ZoneR ReferencesExercisesHands-On Exercises
Part IV: Clustering
Chapter 19: Hierarchical and -Means Clustering
19.1 The Clustering Task19.2 Hierarchical Clustering Methods19.3 Single-Linkage Clustering19.4 Complete-Linkage Clustering19.5 -Means Clustering19.6 Example of -Means Clustering at Work19.7 Behavior of MSB, MSE, and Pseudo-F as the -Means Algorithm Proceeds19.8 Application of -Means Clustering Using SAS Enterprise Miner19.9 Using Cluster Membership to Predict ChurnThe R ZoneR ReferencesExercisesHands-On Analysis
Chapter 20: Kohonen Networks
20.1 Self-Organizing Maps20.2 Kohonen Networks20.3 Example of a Kohonen Network Study20.4 Cluster Validity20.5 Application of Clustering Using Kohonen Networks20.6 Interpreting The Clusters20.7 Using Cluster Membership as Input to Downstream Data Mining ModelsThe R ZoneR ReferencesExercises
Chapter 21: BIRCH Clustering
21.1 Rationale for BIRCH Clustering21.2 Cluster Features21.3 Cluster Feature TREE21.4 Phase 1: Building The CF Tree21.5 Phase 2: Clustering The Sub-Clusters21.6 Example of Birch Clustering, Phase 1: Building The CF Tree21.7 Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters21.8 Evaluating The Candidate Cluster Solutions21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data SetThe R ZoneR ReferencesExercises
Chapter 22: Measuring Cluster Goodness
22.1 Rationale for Measuring Cluster Goodness22.2 The Silhouette Method22.3 Silhouette Example22.4 Silhouette Analysis of the IRIS Data Set22.5 The Pseudo-F Statistic22.6 Example of the Pseudo-F Statistic22.7 Pseudo-F Statistic Applied to the IRIS Data Set22.8 Cluster Validation22.9 Cluster Validation Applied to the Loans Data SetThe R ZoneR ReferencesExercises
Part V: Association Rules
Chapter 23: Association Rules
23.1 Affinity Analysis and Market Basket Analysis23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules23.5 Extension From Flag Data to General Categorical Data23.6 Information-Theoretic Approach: Generalized Rule Induction Method23.7 Association Rules are Easy to do Badly23.8 How Can We Measure the Usefulness of Association Rules?23.9 Do Association Rules Represent Supervised or Unsupervised Learning?23.10 Local Patterns Versus Global ModelsThe R ZoneR ReferencesExercises
Part VI: Enhancing Model Performance
Chapter 24: Segmentation Models
24.1 The Segmentation Modeling Process24.2 Segmentation Modeling Using EDA to Identify the Segments24.3 Segmentation Modeling using Clustering to Identify the SegmentsThe R ZoneR ReferencesExercises
Chapter 25: Ensemble Methods: Bagging and Boosting
25.1 Rationale for Using an Ensemble of Classification Models25.2 Bias, Variance, and Noise25.3 When to Apply, and not to apply, Bagging25.4 Bagging25.5 Boosting25.6 Application of Bagging and Boosting Using IBM/SPSS ModelerReferencesThe R ZoneR ReferenceExercises
Chapter 26: Model Voting and Propensity Averaging
26.1 Simple Model Voting26.2 Alternative Voting Methods26.3 Model Voting Process26.4 An Application of Model Voting26.5 What is Propensity Averaging?26.6 Propensity Averaging Process26.7 An Application of Propensity AveragingThe R ZoneR ReferencesExercisesHands-On Analysis
Part VII: Further Topics
Chapter 27: Genetic Algorithms
27.1 Introduction To Genetic Algorithms27.2 Basic Framework of a Genetic Algorithm27.3 Simple Example of a Genetic Algorithm at Work27.4 Modifications and Enhancements: Selection27.5 Modifications and Enhancements: Crossover27.6 Genetic Algorithms for Real-Valued Variables27.7 Using Genetic Algorithms to Train a Neural Network27.8 WEKA: Hands-On Analysis Using Genetic AlgorithmsThe R ZoneR References
Chapter 28: Imputation of Missing Data
28.1 Need for Imputation of Missing Data28.2 Imputation of Missing Data: Continuous Variables28.3 Standard Error of the Imputation28.4 Imputation of Missing Data: Categorical Variables28.5 Handling Patterns in MissingnessReferenceThe R ZoneR References
Part VIII: Case Study: Predicting Response to Direct-Mail Marketing
Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA
29.1 Cross-Industry Standard Practice for Data Mining29.2 Business Understanding Phase29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set29.4 Data Preparation Phase29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis
Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis
30.1 Partitioning the Data30.2 Developing the Principal Components30.3 Validating the Principal Components30.4 Profiling the Principal Components30.5 Choosing the Optimal Number of Clusters Using Birch Clustering30.6 Choosing the Optimal Number of Clusters Using k-Means Clustering30.7 Application of k-Means Clustering30.8 Validating the Clusters30.9 Profiling the Clusters
Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability
31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?31.2 Modeling And Evaluation Overview31.3 Cost-Benefit Analysis Using Data-Driven Costs31.4 Variables to be Input To The Models31.5 Establishing The Baseline Model Performance31.6 Models That Use Misclassification Costs31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs31.8 Combining Models Using Voting and Propensity Averaging31.9 Interpreting The Most Profitable Model
Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only
32.1 Variables to be Input to the Models32.2 Models that use Misclassification Costs32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs32.4 Combining Models using Voting and Propensity Averaging32.5 Lessons Learned32.6 Conclusions
Appendix A: Data Summarization and Visualization
Part 1: Summarization 1: Building Blocks Of Data AnalysisPart 2: Visualization: Graphs and Tables For Summarizing And Organizing DataPart 3: Summarization 2: Measures Of Center, Variability, and PositionPart 4: Summarization And Visualization Of Bivariate Relationships
Index
End User License Agreement

Content preview from Data Mining and Predictive Analytics, 2nd Edition

Chapter 18Graphical Evaluation of Classification Models

18.1 Review of Lift Charts and Gains Charts

In Chapter 15, we learned about lift charts and gains charts. Recall that lift is defined as the proportion of positive hits in the set of the model's positive classifications, divided by the proportion of positive hits in the data set overall:

where a hit is defined as a positive response that was predicted to be positive. To construct a lift chart, the software sorts the records by propensity to respond positively, and then calculates the lift at each percentile. For example, a lift value of 2.0 at the 20th percentile means that the 20% of records that contain the most likely responders have twice as many responders as a similarly sized random sample of records. Gains charts represent the cumulative form of lift charts. For more on lift charts and gains charts, see Chapter 15.

18.2 Lift Charts and Gains Charts Using Misclassification Costs

Lift charts and gains charts may be used in the presence of misclassification costs. This works because the software ranks the records by propensity to respond, and the misclassification costs directly affect the propensity to respond for a given classification model. Recall the Loans data set, where a bank would like to predict loan approval for a training data set of about 150,000 loan applicants, based on the predictors debt-to-income ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118868706Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design