Book description
Learn methods of data analysis and their application to realworld data sets
This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with handson analysis problems, representing an opportunity for readers to apply their newlyacquired data mining expertise to solving real problems using large, realworld data sets.
Data Mining and Predictive Analytics, Second Edition:
Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language
Features over 750 chapter exercises, allowing readers to assess their understanding of the new material
Provides a detailed case study that brings together the lessons learned in the book
Includes access to the companion website, www.dataminingconsultant
, with exclusive passwordprotected instructor content
Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.
Table of contents
 Cover
 Series
 Title Page
 Copyright
 Dedication

Preface
 What is Data Mining? What is Predictive Analytics?
 Why is this Book Needed?
 Who Will Benefit from this Book?
 Danger! Data Mining is Easy to do Badly
 “WhiteBox” Approach
 Algorithm WalkThroughs
 Exciting New Topics
 The R Zone
 Appendix: Data Summarization and Visualization
 The Case Study: Bringing it all Together
 How the Book is Structured
 The Software
 Weka: The OpenSource Alternative
 The Companion Web Site: www.dataminingconsultant.com
 Data Mining and Predictive Analytics as a Textbook
 Acknowledgments
 Part I: Data Preparation
 Chapter 1: An Introduction to Data Mining and Predictive Analytics

Chapter 2: Data Preprocessing
 2.1 Why do We Need to Preprocess the Data?
 2.2 Data Cleaning
 2.3 Handling Missing Data
 2.4 Identifying Misclassifications
 2.5 Graphical Methods for Identifying Outliers
 2.6 Measures of Center and Spread
 2.7 Data Transformation
 2.8 Min–Max Normalization
 2.9 ZScore Standardization
 2.10 Decimal Scaling
 2.11 Transformations to Achieve Normality
 2.12 Numerical Methods for Identifying Outliers
 2.13 Flag Variables
 2.14 Transforming Categorical Variables into Numerical Variables
 2.15 Binning Numerical Variables
 2.16 Reclassifying Categorical Variables
 2.17 Adding an Index Field
 2.18 Removing Variables that are not Useful
 2.19 Variables that Should Probably not be Removed
 2.20 Removal of Duplicate Records
 2.21 A Word About ID Fields
 The R Zone
 R Reference
 Exercises

Chapter 3: Exploratory Data Analysis
 3.1 Hypothesis Testing Versus Exploratory Data Analysis
 3.2 Getting to Know The Data Set
 3.3 Exploring Categorical Variables
 3.4 Exploring Numeric Variables
 3.5 Exploring Multivariate Relationships
 3.6 Selecting Interesting Subsets of the Data for Further Investigation
 3.7 Using EDA to Uncover Anomalous Fields
 3.8 Binning Based on Predictive Value
 3.9 Deriving New Variables: Flag Variables
 3.10 Deriving New Variables: Numerical Variables
 3.11 Using EDA to Investigate Correlated Predictor Variables
 3.12 Summary of Our EDA
 The R Zone
 R References
 Exercises

Chapter 4: DimensionReduction Methods
 4.1 Need for DimensionReduction in Data Mining
 4.2 Principal Components Analysis
 4.3 Applying PCA to the Houses Data Set
 4.4 How Many Components Should We Extract?
 4.5 Profiling the Principal Components
 4.6 Communalities
 4.7 Validation of the Principal Components
 4.8 Factor Analysis
 4.9 Applying Factor Analysis to the Adult Data Set
 4.10 Factor Rotation
 4.11 UserDefined Composites
 4.12 An Example of a UserDefined Composite
 The R Zone
 R References
 Exercises
 Part II: Statistical Analysis

Chapter 5: Univariate Statistical Analysis
 5.1 Data Mining Tasks in Discovering Knowledge in Data
 5.2 Statistical Approaches to Estimation and Prediction
 5.3 Statistical Inference
 5.4 How Confident are We in Our Estimates?
 5.5 Confidence Interval Estimation of the Mean
 5.6 How to Reduce the Margin of Error
 5.7 Confidence Interval Estimation of the Proportion
 5.8 Hypothesis Testing for the Mean
 5.9 Assessing The Strength of Evidence Against The Null Hypothesis
 5.10 Using Confidence Intervals to Perform Hypothesis Tests
 5.11 Hypothesis Testing for The Proportion
 Reference
 The R Zone
 R Reference
 Exercises
 Chapter 6: Multivariate Statistics
 Chapter 7: Preparing to Model the Data

Chapter 8: Simple Linear Regression
 8.1 An Example of Simple Linear Regression
 8.2 Dangers of Extrapolation
 8.3 How Useful is the Regression? The Coefficient of Determination, 2
 8.4 Standard Error of the Estimate,
 8.5 Correlation Coefficient
 8.6 Anova Table for Simple Linear Regression
 8.7 Outliers, High Leverage Points, and Influential Observations
 8.8 Population Regression Equation
 8.9 Verifying The Regression Assumptions
 8.10 Inference in Regression
 8.11 tTest for the Relationship Between x and y
 8.12 Confidence Interval for the Slope of the Regression Line
 8.13 Confidence Interval for the Correlation Coefficient ρ
 8.14 Confidence Interval for the Mean Value of Given
 8.15 Prediction Interval for a Randomly Chosen Value of Given
 8.16 Transformations to Achieve Linearity
 8.17 Box–Cox Transformations
 The R Zone
 R References
 Exercises

Chapter 9: Multiple Regression and Model Building
 9.1 An Example of Multiple Regression
 9.2 The Population Multiple Regression Equation
 9.3 Inference in Multiple Regression
 9.4 Regression With Categorical Predictors, Using Indicator Variables
 9.5 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
 9.6 Sequential Sums of Squares
 9.7 Multicollinearity
 9.8 Variable Selection Methods
 9.9 Gas Mileage Data Set
 9.10 An Application of Variable Selection Methods
 9.11 Using the Principal Components as Predictors in Multiple Regression
 The R Zone
 R References
 Exercises
 Part III: Classification

Chapter 10: kNearest Neighbor Algorithm
 10.1 Classification Task
 10.2 kNearest Neighbor Algorithm
 10.3 Distance Function
 10.4 Combination Function
 10.5 Quantifying Attribute Relevance: Stretching the Axes
 10.6 Database Considerations
 10.7 kNearest Neighbor Algorithm for Estimation and Prediction
 10.8 Choosing k
 10.9 Application of kNearest Neighbor Algorithm Using IBM/SPSS Modeler
 The R Zone
 R References
 Exercises
 Chapter 11: Decision Trees

Chapter 12: Neural Networks
 12.1 Input and Output Encoding
 12.2 Neural Networks for Estimation and Prediction
 12.3 Simple Example of a Neural Network
 12.4 Sigmoid Activation Function
 12.5 BackPropagation
 12.6 GradientDescent Method
 12.7 BackPropagation Rules
 12.8 Example of BackPropagation
 12.9 Termination Criteria
 12.10 Learning Rate
 12.11 Momentum Term
 12.12 Sensitivity Analysis
 12.13 Application of Neural Network Modeling
 The R Zone
 R References
 Exercises

Chapter 13: Logistic Regression
 13.1 Simple Example of Logistic Regression
 13.2 Maximum Likelihood Estimation
 13.3 Interpreting Logistic Regression Output
 13.4 Inference: Are the Predictors Significant?
 13.5 Odds Ratio and Relative Risk
 13.6 Interpreting Logistic Regression for a Dichotomous Predictor
 13.7 Interpreting Logistic Regression for a Polychotomous Predictor
 13.8 Interpreting Logistic Regression for a Continuous Predictor
 13.9 Assumption of Linearity
 13.10 ZeroCell Problem
 13.11 Multiple Logistic Regression
 13.12 Introducing Higher Order Terms to Handle Nonlinearity
 13.13 Validating the Logistic Regression Model
 13.14 WEKA: HandsOn Analysis Using Logistic Regression
 The R Zone
 R References
 Exercises

Chapter 14: NaÏVe Bayes and Bayesian Networks
 14.1 Bayesian Approach
 14.2 Maximum A Posteriori (MAP) Classification
 14.3 Posterior Odds Ratio
 14.4 Balancing The Data
 14.5 Naïve Bayes Classification
 14.6 Interpreting The Log Posterior Odds Ratio
 14.7 ZeroCell Problem
 14.8 Numeric Predictors for Naïve Bayes Classification
 14.9 WEKA: Handson Analysis Using Naïve Bayes
 14.10 Bayesian Belief Networks
 14.11 Clothing Purchase Example
 14.12 Using The Bayesian Network to Find Probabilities
 The R Zone
 R References
 Exercises

Chapter 15: Model Evaluation Techniques
 15.1 Model Evaluation Techniques for the Description Task
 15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
 15.3 Model Evaluation Measures for the Classification Task
 15.4 Accuracy and Overall Error Rate
 15.5 Sensitivity and Specificity
 15.6 FalsePositive Rate and FalseNegative Rate
 15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives
 15.8 Misclassification Cost Adjustment to Reflect RealWorld Concerns
 15.9 Decision Cost/Benefit Analysis
 15.10 Lift Charts and Gains Charts
 15.11 Interweaving Model Evaluation with Model Building
 15.12 Confluence of Results: Applying a Suite of Models
 The R Zone
 R References
 Exercises
 HandsOn Analysis

Chapter 16: CostBenefit Analysis Using DataDriven Costs
 16.1 Decision Invariance Under Row Adjustment
 16.2 Positive Classification Criterion
 16.3 Demonstration Of The Positive Classification Criterion
 16.4 Constructing The Cost Matrix
 16.5 Decision Invariance Under Scaling
 16.6 Direct Costs and Opportunity Costs
 16.7 Case Study: CostBenefit Analysis Using DataDriven Misclassification Costs
 16.8 Rebalancing as a Surrogate for Misclassification Costs
 The R Zone
 R References
 Exercises

Chapter 17: CostBenefit Analysis for Trinary and Nary Classification Models
 17.1 Classification Evaluation Measures for a Generic Trinary Target
 17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
 17.3 DataDriven CostBenefit Analysis for Trinary Loan Classification Problem
 17.4 Comparing Cart Models With and Without DataDriven Misclassification Costs
 17.5 Classification Evaluation Measures for a Generic kNary Target
 17.6 Example of Evaluation Measures and DataDriven Misclassification Costs for kNary Classification
 The R Zone
 R References
 Exercises
 Chapter 18: Graphical Evaluation of Classification Models
 Part IV: Clustering

Chapter 19: Hierarchical and Means Clustering
 19.1 The Clustering Task
 19.2 Hierarchical Clustering Methods
 19.3 SingleLinkage Clustering
 19.4 CompleteLinkage Clustering
 19.5 Means Clustering
 19.6 Example of Means Clustering at Work
 19.7 Behavior of MSB, MSE, and PseudoF as the Means Algorithm Proceeds
 19.8 Application of Means Clustering Using SAS Enterprise Miner
 19.9 Using Cluster Membership to Predict Churn
 The R Zone
 R References
 Exercises
 HandsOn Analysis

Chapter 20: Kohonen Networks
 20.1 SelfOrganizing Maps
 20.2 Kohonen Networks
 20.3 Example of a Kohonen Network Study
 20.4 Cluster Validity
 20.5 Application of Clustering Using Kohonen Networks
 20.6 Interpreting The Clusters
 20.7 Using Cluster Membership as Input to Downstream Data Mining Models
 The R Zone
 R References
 Exercises

Chapter 21: BIRCH Clustering
 21.1 Rationale for BIRCH Clustering
 21.2 Cluster Features
 21.3 Cluster Feature TREE
 21.4 Phase 1: Building The CF Tree
 21.5 Phase 2: Clustering The SubClusters
 21.6 Example of Birch Clustering, Phase 1: Building The CF Tree
 21.7 Example of BIRCH Clustering, Phase 2: Clustering The SubClusters
 21.8 Evaluating The Candidate Cluster Solutions
 21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
 The R Zone
 R References
 Exercises

Chapter 22: Measuring Cluster Goodness
 22.1 Rationale for Measuring Cluster Goodness
 22.2 The Silhouette Method
 22.3 Silhouette Example
 22.4 Silhouette Analysis of the IRIS Data Set
 22.5 The PseudoF Statistic
 22.6 Example of the PseudoF Statistic
 22.7 PseudoF Statistic Applied to the IRIS Data Set
 22.8 Cluster Validation
 22.9 Cluster Validation Applied to the Loans Data Set
 The R Zone
 R References
 Exercises
 Part V: Association Rules

Chapter 23: Association Rules
 23.1 Affinity Analysis and Market Basket Analysis
 23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property
 23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
 23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
 23.5 Extension From Flag Data to General Categorical Data
 23.6 InformationTheoretic Approach: Generalized Rule Induction Method
 23.7 Association Rules are Easy to do Badly
 23.8 How Can We Measure the Usefulness of Association Rules?
 23.9 Do Association Rules Represent Supervised or Unsupervised Learning?
 23.10 Local Patterns Versus Global Models
 The R Zone
 R References
 Exercises
 Part VI: Enhancing Model Performance
 Chapter 24: Segmentation Models
 Chapter 25: Ensemble Methods: Bagging and Boosting
 Chapter 26: Model Voting and Propensity Averaging
 Part VII: Further Topics

Chapter 27: Genetic Algorithms
 27.1 Introduction To Genetic Algorithms
 27.2 Basic Framework of a Genetic Algorithm
 27.3 Simple Example of a Genetic Algorithm at Work
 27.4 Modifications and Enhancements: Selection
 27.5 Modifications and Enhancements: Crossover
 27.6 Genetic Algorithms for RealValued Variables
 27.7 Using Genetic Algorithms to Train a Neural Network
 27.8 WEKA: HandsOn Analysis Using Genetic Algorithms
 The R Zone
 R References
 Chapter 28: Imputation of Missing Data
 Part VIII: Case Study: Predicting Response to DirectMail Marketing
 Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA

Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis
 30.1 Partitioning the Data
 30.2 Developing the Principal Components
 30.3 Validating the Principal Components
 30.4 Profiling the Principal Components
 30.5 Choosing the Optimal Number of Clusters Using Birch Clustering
 30.6 Choosing the Optimal Number of Clusters Using kMeans Clustering
 30.7 Application of kMeans Clustering
 30.8 Validating the Clusters
 30.9 Profiling the Clusters

Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability
 31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
 31.2 Modeling And Evaluation Overview
 31.3 CostBenefit Analysis Using DataDriven Costs
 31.4 Variables to be Input To The Models
 31.5 Establishing The Baseline Model Performance
 31.6 Models That Use Misclassification Costs
 31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs
 31.8 Combining Models Using Voting and Propensity Averaging
 31.9 Interpreting The Most Profitable Model
 Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only
 Appendix A: Data Summarization and Visualization
 Index
 End User License Agreement
Product information
 Title: Data Mining and Predictive Analytics, 2nd Edition
 Author(s):
 Release date: March 2015
 Publisher(s): Wiley
 ISBN: 9781118116197
You might also like
book
Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition
The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …
book
Storytelling with Data: A Data Visualization Guide for Business Professionals
Don't simply show your data—tell a story with it! Storytelling with Data teaches you the fundamentals …
book
Analytical Skills for AI and Data Science
While several marketleading companies have successfully transformed their business models by following data and AIdriven paths, …