Data Mining and Predictive Analytics, 2nd Edition

Book description

Learn methods of data analysis and their application to real-world data sets

This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets.

Data Mining and Predictive Analytics, Second Edition:

  • Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language

  • Features over 750 chapter exercises, allowing readers to assess their understanding of the new material

  • Provides a detailed case study that brings together the lessons learned in the book

  • Includes access to the companion website, www.dataminingconsultant, with exclusive password-protected instructor content

  • Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.

    Table of contents

    1. Cover
    2. Series
    3. Title Page
    4. Copyright
    5. Dedication
    6. Preface
      1. What is Data Mining? What is Predictive Analytics?
      2. Why is this Book Needed?
      3. Who Will Benefit from this Book?
      4. Danger! Data Mining is Easy to do Badly
      5. “White-Box” Approach
      6. Algorithm Walk-Throughs
      7. Exciting New Topics
      8. The R Zone
      9. Appendix: Data Summarization and Visualization
      10. The Case Study: Bringing it all Together
      11. How the Book is Structured
      12. The Software
      13. Weka: The Open-Source Alternative
      14. The Companion Web Site: www.dataminingconsultant.com
      15. Data Mining and Predictive Analytics as a Textbook
    7. Acknowledgments
      1. Daniel's Acknowledgments
      2. Chantal's Acknowledgments
    8. Part I: Data Preparation
    9. Chapter 1: An Introduction to Data Mining and Predictive Analytics
      1. 1.1 What is Data Mining? What Is Predictive Analytics?
      2. 1.2 Wanted: Data Miners
      3. 1.3 The Need For Human Direction of Data Mining
      4. 1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM
      5. 1.5 Fallacies of Data Mining
      6. 1.6 What Tasks can Data Mining Accomplish
      7. The R Zone
      8. R References
      9. Exercises
    10. Chapter 2: Data Preprocessing
      1. 2.1 Why do We Need to Preprocess the Data?
      2. 2.2 Data Cleaning
      3. 2.3 Handling Missing Data
      4. 2.4 Identifying Misclassifications
      5. 2.5 Graphical Methods for Identifying Outliers
      6. 2.6 Measures of Center and Spread
      7. 2.7 Data Transformation
      8. 2.8 Min–Max Normalization
      9. 2.9 Z-Score Standardization
      10. 2.10 Decimal Scaling
      11. 2.11 Transformations to Achieve Normality
      12. 2.12 Numerical Methods for Identifying Outliers
      13. 2.13 Flag Variables
      14. 2.14 Transforming Categorical Variables into Numerical Variables
      15. 2.15 Binning Numerical Variables
      16. 2.16 Reclassifying Categorical Variables
      17. 2.17 Adding an Index Field
      18. 2.18 Removing Variables that are not Useful
      19. 2.19 Variables that Should Probably not be Removed
      20. 2.20 Removal of Duplicate Records
      21. 2.21 A Word About ID Fields
      22. The R Zone
      23. R Reference
      24. Exercises
    11. Chapter 3: Exploratory Data Analysis
      1. 3.1 Hypothesis Testing Versus Exploratory Data Analysis
      2. 3.2 Getting to Know The Data Set
      3. 3.3 Exploring Categorical Variables
      4. 3.4 Exploring Numeric Variables
      5. 3.5 Exploring Multivariate Relationships
      6. 3.6 Selecting Interesting Subsets of the Data for Further Investigation
      7. 3.7 Using EDA to Uncover Anomalous Fields
      8. 3.8 Binning Based on Predictive Value
      9. 3.9 Deriving New Variables: Flag Variables
      10. 3.10 Deriving New Variables: Numerical Variables
      11. 3.11 Using EDA to Investigate Correlated Predictor Variables
      12. 3.12 Summary of Our EDA
      13. The R Zone
      14. R References
      15. Exercises
    12. Chapter 4: Dimension-Reduction Methods
      1. 4.1 Need for Dimension-Reduction in Data Mining
      2. 4.2 Principal Components Analysis
      3. 4.3 Applying PCA to the Houses Data Set
      4. 4.4 How Many Components Should We Extract?
      5. 4.5 Profiling the Principal Components
      6. 4.6 Communalities
      7. 4.7 Validation of the Principal Components
      8. 4.8 Factor Analysis
      9. 4.9 Applying Factor Analysis to the Adult Data Set
      10. 4.10 Factor Rotation
      11. 4.11 User-Defined Composites
      12. 4.12 An Example of a User-Defined Composite
      13. The R Zone
      14. R References
      15. Exercises
    13. Part II: Statistical Analysis
    14. Chapter 5: Univariate Statistical Analysis
      1. 5.1 Data Mining Tasks in Discovering Knowledge in Data
      2. 5.2 Statistical Approaches to Estimation and Prediction
      3. 5.3 Statistical Inference
      4. 5.4 How Confident are We in Our Estimates?
      5. 5.5 Confidence Interval Estimation of the Mean
      6. 5.6 How to Reduce the Margin of Error
      7. 5.7 Confidence Interval Estimation of the Proportion
      8. 5.8 Hypothesis Testing for the Mean
      9. 5.9 Assessing The Strength of Evidence Against The Null Hypothesis
      10. 5.10 Using Confidence Intervals to Perform Hypothesis Tests
      11. 5.11 Hypothesis Testing for The Proportion
      12. Reference
      13. The R Zone
      14. R Reference
      15. Exercises
    15. Chapter 6: Multivariate Statistics
      1. 6.1 Two-Sample t-Test for Difference in Means
      2. 6.2 Two-Sample Z-Test for Difference in Proportions
      3. 6.3 Test for the Homogeneity of Proportions
      4. 6.4 Chi-Square Test for Goodness of Fit of Multinomial Data
      5. 6.5 Analysis of Variance
      6. Reference
      7. The R Zone
      8. R Reference
      9. Exercises
    16. Chapter 7: Preparing to Model the Data
      1. 7.1 Supervised Versus Unsupervised Methods
      2. 7.2 Statistical Methodology and Data Mining Methodology
      3. 7.3 Cross-Validation
      4. 7.4 Overfitting
      5. 7.5 Bias–Variance Trade-Off
      6. 7.6 Balancing The Training Data Set
      7. 7.7 Establishing Baseline Performance
      8. The R Zone
      9. R Reference
      10. Exercises
    17. Chapter 8: Simple Linear Regression
      1. 8.1 An Example of Simple Linear Regression
      2. 8.2 Dangers of Extrapolation
      3. 8.3 How Useful is the Regression? The Coefficient of Determination, 2
      4. 8.4 Standard Error of the Estimate,
      5. 8.5 Correlation Coefficient
      6. 8.6 Anova Table for Simple Linear Regression
      7. 8.7 Outliers, High Leverage Points, and Influential Observations
      8. 8.8 Population Regression Equation
      9. 8.9 Verifying The Regression Assumptions
      10. 8.10 Inference in Regression
      11. 8.11 t-Test for the Relationship Between x and y
      12. 8.12 Confidence Interval for the Slope of the Regression Line
      13. 8.13 Confidence Interval for the Correlation Coefficient ρ
      14. 8.14 Confidence Interval for the Mean Value of Given
      15. 8.15 Prediction Interval for a Randomly Chosen Value of Given
      16. 8.16 Transformations to Achieve Linearity
      17. 8.17 Box–Cox Transformations
      18. The R Zone
      19. R References
      20. Exercises
    18. Chapter 9: Multiple Regression and Model Building
      1. 9.1 An Example of Multiple Regression
      2. 9.2 The Population Multiple Regression Equation
      3. 9.3 Inference in Multiple Regression
      4. 9.4 Regression With Categorical Predictors, Using Indicator Variables
      5. 9.5 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
      6. 9.6 Sequential Sums of Squares
      7. 9.7 Multicollinearity
      8. 9.8 Variable Selection Methods
      9. 9.9 Gas Mileage Data Set
      10. 9.10 An Application of Variable Selection Methods
      11. 9.11 Using the Principal Components as Predictors in Multiple Regression
      12. The R Zone
      13. R References
      14. Exercises
    19. Part III: Classification
    20. Chapter 10: k-Nearest Neighbor Algorithm
      1. 10.1 Classification Task
      2. 10.2 k-Nearest Neighbor Algorithm
      3. 10.3 Distance Function
      4. 10.4 Combination Function
      5. 10.5 Quantifying Attribute Relevance: Stretching the Axes
      6. 10.6 Database Considerations
      7. 10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction
      8. 10.8 Choosing k
      9. 10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
      10. The R Zone
      11. R References
      12. Exercises
    21. Chapter 11: Decision Trees
      1. 11.1 What is a Decision Tree?
      2. 11.2 Requirements for Using Decision Trees
      3. 11.3 Classification and Regression Trees
      4. 11.4 C4.5 Algorithm
      5. 11.5 Decision Rules
      6. 11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data
      7. The R Zone
      8. R References
      9. Exercises
    22. Chapter 12: Neural Networks
      1. 12.1 Input and Output Encoding
      2. 12.2 Neural Networks for Estimation and Prediction
      3. 12.3 Simple Example of a Neural Network
      4. 12.4 Sigmoid Activation Function
      5. 12.5 Back-Propagation
      6. 12.6 Gradient-Descent Method
      7. 12.7 Back-Propagation Rules
      8. 12.8 Example of Back-Propagation
      9. 12.9 Termination Criteria
      10. 12.10 Learning Rate
      11. 12.11 Momentum Term
      12. 12.12 Sensitivity Analysis
      13. 12.13 Application of Neural Network Modeling
      14. The R Zone
      15. R References
      16. Exercises
    23. Chapter 13: Logistic Regression
      1. 13.1 Simple Example of Logistic Regression
      2. 13.2 Maximum Likelihood Estimation
      3. 13.3 Interpreting Logistic Regression Output
      4. 13.4 Inference: Are the Predictors Significant?
      5. 13.5 Odds Ratio and Relative Risk
      6. 13.6 Interpreting Logistic Regression for a Dichotomous Predictor
      7. 13.7 Interpreting Logistic Regression for a Polychotomous Predictor
      8. 13.8 Interpreting Logistic Regression for a Continuous Predictor
      9. 13.9 Assumption of Linearity
      10. 13.10 Zero-Cell Problem
      11. 13.11 Multiple Logistic Regression
      12. 13.12 Introducing Higher Order Terms to Handle Nonlinearity
      13. 13.13 Validating the Logistic Regression Model
      14. 13.14 WEKA: Hands-On Analysis Using Logistic Regression
      15. The R Zone
      16. R References
      17. Exercises
    24. Chapter 14: NaÏVe Bayes and Bayesian Networks
      1. 14.1 Bayesian Approach
      2. 14.2 Maximum A Posteriori (MAP) Classification
      3. 14.3 Posterior Odds Ratio
      4. 14.4 Balancing The Data
      5. 14.5 Naïve Bayes Classification
      6. 14.6 Interpreting The Log Posterior Odds Ratio
      7. 14.7 Zero-Cell Problem
      8. 14.8 Numeric Predictors for Naïve Bayes Classification
      9. 14.9 WEKA: Hands-on Analysis Using Naïve Bayes
      10. 14.10 Bayesian Belief Networks
      11. 14.11 Clothing Purchase Example
      12. 14.12 Using The Bayesian Network to Find Probabilities
      13. The R Zone
      14. R References
      15. Exercises
    25. Chapter 15: Model Evaluation Techniques
      1. 15.1 Model Evaluation Techniques for the Description Task
      2. 15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
      3. 15.3 Model Evaluation Measures for the Classification Task
      4. 15.4 Accuracy and Overall Error Rate
      5. 15.5 Sensitivity and Specificity
      6. 15.6 False-Positive Rate and False-Negative Rate
      7. 15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives
      8. 15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns
      9. 15.9 Decision Cost/Benefit Analysis
      10. 15.10 Lift Charts and Gains Charts
      11. 15.11 Interweaving Model Evaluation with Model Building
      12. 15.12 Confluence of Results: Applying a Suite of Models
      13. The R Zone
      14. R References
      15. Exercises
      16. Hands-On Analysis
    26. Chapter 16: Cost-Benefit Analysis Using Data-Driven Costs
      1. 16.1 Decision Invariance Under Row Adjustment
      2. 16.2 Positive Classification Criterion
      3. 16.3 Demonstration Of The Positive Classification Criterion
      4. 16.4 Constructing The Cost Matrix
      5. 16.5 Decision Invariance Under Scaling
      6. 16.6 Direct Costs and Opportunity Costs
      7. 16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
      8. 16.8 Rebalancing as a Surrogate for Misclassification Costs
      9. The R Zone
      10. R References
      11. Exercises
    27. Chapter 17: Cost-Benefit Analysis for Trinary and -Nary Classification Models
      1. 17.1 Classification Evaluation Measures for a Generic Trinary Target
      2. 17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
      3. 17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
      4. 17.4 Comparing Cart Models With and Without Data-Driven Misclassification Costs
      5. 17.5 Classification Evaluation Measures for a Generic k-Nary Target
      6. 17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
      7. The R Zone
      8. R References
      9. Exercises
    28. Chapter 18: Graphical Evaluation of Classification Models
      1. 18.1 Review of Lift Charts and Gains Charts
      2. 18.2 Lift Charts and Gains Charts Using Misclassification Costs
      3. 18.3 Response Charts
      4. 18.4 Profits Charts
      5. 18.5 Return on Investment (ROI) Charts
      6. The R Zone
      7. R References
      8. Exercises
      9. Hands-On Exercises
    29. Part IV: Clustering
    30. Chapter 19: Hierarchical and -Means Clustering
      1. 19.1 The Clustering Task
      2. 19.2 Hierarchical Clustering Methods
      3. 19.3 Single-Linkage Clustering
      4. 19.4 Complete-Linkage Clustering
      5. 19.5 -Means Clustering
      6. 19.6 Example of -Means Clustering at Work
      7. 19.7 Behavior of MSB, MSE, and Pseudo-F as the -Means Algorithm Proceeds
      8. 19.8 Application of -Means Clustering Using SAS Enterprise Miner
      9. 19.9 Using Cluster Membership to Predict Churn
      10. The R Zone
      11. R References
      12. Exercises
      13. Hands-On Analysis
    31. Chapter 20: Kohonen Networks
      1. 20.1 Self-Organizing Maps
      2. 20.2 Kohonen Networks
      3. 20.3 Example of a Kohonen Network Study
      4. 20.4 Cluster Validity
      5. 20.5 Application of Clustering Using Kohonen Networks
      6. 20.6 Interpreting The Clusters
      7. 20.7 Using Cluster Membership as Input to Downstream Data Mining Models
      8. The R Zone
      9. R References
      10. Exercises
    32. Chapter 21: BIRCH Clustering
      1. 21.1 Rationale for BIRCH Clustering
      2. 21.2 Cluster Features
      3. 21.3 Cluster Feature TREE
      4. 21.4 Phase 1: Building The CF Tree
      5. 21.5 Phase 2: Clustering The Sub-Clusters
      6. 21.6 Example of Birch Clustering, Phase 1: Building The CF Tree
      7. 21.7 Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
      8. 21.8 Evaluating The Candidate Cluster Solutions
      9. 21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
      10. The R Zone
      11. R References
      12. Exercises
    33. Chapter 22: Measuring Cluster Goodness
      1. 22.1 Rationale for Measuring Cluster Goodness
      2. 22.2 The Silhouette Method
      3. 22.3 Silhouette Example
      4. 22.4 Silhouette Analysis of the IRIS Data Set
      5. 22.5 The Pseudo-F Statistic
      6. 22.6 Example of the Pseudo-F Statistic
      7. 22.7 Pseudo-F Statistic Applied to the IRIS Data Set
      8. 22.8 Cluster Validation
      9. 22.9 Cluster Validation Applied to the Loans Data Set
      10. The R Zone
      11. R References
      12. Exercises
    34. Part V: Association Rules
    35. Chapter 23: Association Rules
      1. 23.1 Affinity Analysis and Market Basket Analysis
      2. 23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property
      3. 23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
      4. 23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
      5. 23.5 Extension From Flag Data to General Categorical Data
      6. 23.6 Information-Theoretic Approach: Generalized Rule Induction Method
      7. 23.7 Association Rules are Easy to do Badly
      8. 23.8 How Can We Measure the Usefulness of Association Rules?
      9. 23.9 Do Association Rules Represent Supervised or Unsupervised Learning?
      10. 23.10 Local Patterns Versus Global Models
      11. The R Zone
      12. R References
      13. Exercises
    36. Part VI: Enhancing Model Performance
    37. Chapter 24: Segmentation Models
      1. 24.1 The Segmentation Modeling Process
      2. 24.2 Segmentation Modeling Using EDA to Identify the Segments
      3. 24.3 Segmentation Modeling using Clustering to Identify the Segments
      4. The R Zone
      5. R References
      6. Exercises
    38. Chapter 25: Ensemble Methods: Bagging and Boosting
      1. 25.1 Rationale for Using an Ensemble of Classification Models
      2. 25.2 Bias, Variance, and Noise
      3. 25.3 When to Apply, and not to apply, Bagging
      4. 25.4 Bagging
      5. 25.5 Boosting
      6. 25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler
      7. References
      8. The R Zone
      9. R Reference
      10. Exercises
    39. Chapter 26: Model Voting and Propensity Averaging
      1. 26.1 Simple Model Voting
      2. 26.2 Alternative Voting Methods
      3. 26.3 Model Voting Process
      4. 26.4 An Application of Model Voting
      5. 26.5 What is Propensity Averaging?
      6. 26.6 Propensity Averaging Process
      7. 26.7 An Application of Propensity Averaging
      8. The R Zone
      9. R References
      10. Exercises
      11. Hands-On Analysis
    40. Part VII: Further Topics
    41. Chapter 27: Genetic Algorithms
      1. 27.1 Introduction To Genetic Algorithms
      2. 27.2 Basic Framework of a Genetic Algorithm
      3. 27.3 Simple Example of a Genetic Algorithm at Work
      4. 27.4 Modifications and Enhancements: Selection
      5. 27.5 Modifications and Enhancements: Crossover
      6. 27.6 Genetic Algorithms for Real-Valued Variables
      7. 27.7 Using Genetic Algorithms to Train a Neural Network
      8. 27.8 WEKA: Hands-On Analysis Using Genetic Algorithms
      9. The R Zone
      10. R References
    42. Chapter 28: Imputation of Missing Data
      1. 28.1 Need for Imputation of Missing Data
      2. 28.2 Imputation of Missing Data: Continuous Variables
      3. 28.3 Standard Error of the Imputation
      4. 28.4 Imputation of Missing Data: Categorical Variables
      5. 28.5 Handling Patterns in Missingness
      6. Reference
      7. The R Zone
      8. R References
    43. Part VIII: Case Study: Predicting Response to Direct-Mail Marketing
    44. Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA
      1. 29.1 Cross-Industry Standard Practice for Data Mining
      2. 29.2 Business Understanding Phase
      3. 29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set
      4. 29.4 Data Preparation Phase
      5. 29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis
    45. Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis
      1. 30.1 Partitioning the Data
      2. 30.2 Developing the Principal Components
      3. 30.3 Validating the Principal Components
      4. 30.4 Profiling the Principal Components
      5. 30.5 Choosing the Optimal Number of Clusters Using Birch Clustering
      6. 30.6 Choosing the Optimal Number of Clusters Using k-Means Clustering
      7. 30.7 Application of k-Means Clustering
      8. 30.8 Validating the Clusters
      9. 30.9 Profiling the Clusters
    46. Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability
      1. 31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
      2. 31.2 Modeling And Evaluation Overview
      3. 31.3 Cost-Benefit Analysis Using Data-Driven Costs
      4. 31.4 Variables to be Input To The Models
      5. 31.5 Establishing The Baseline Model Performance
      6. 31.6 Models That Use Misclassification Costs
      7. 31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs
      8. 31.8 Combining Models Using Voting and Propensity Averaging
      9. 31.9 Interpreting The Most Profitable Model
    47. Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only
      1. 32.1 Variables to be Input to the Models
      2. 32.2 Models that use Misclassification Costs
      3. 32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs
      4. 32.4 Combining Models using Voting and Propensity Averaging
      5. 32.5 Lessons Learned
      6. 32.6 Conclusions
    48. Appendix A: Data Summarization and Visualization
      1. Part 1: Summarization 1: Building Blocks Of Data Analysis
      2. Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
      3. Part 3: Summarization 2: Measures Of Center, Variability, and Position
      4. Part 4: Summarization And Visualization Of Bivariate Relationships
    49. Index
    50. End User License Agreement

    Product information

    • Title: Data Mining and Predictive Analytics, 2nd Edition
    • Author(s): Chantal D. Larose, Daniel T. Larose
    • Release date: March 2015
    • Publisher(s): Wiley
    • ISBN: 9781118116197