Book description
Learn methods of data analysis and their application to real-world data sets
This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets.
Data Mining and Predictive Analytics, Second Edition:
Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language
Features over 750 chapter exercises, allowing readers to assess their understanding of the new material
Provides a detailed case study that brings together the lessons learned in the book
Includes access to the companion website, www.dataminingconsultant
, with exclusive password-protected instructor content
Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.
Table of contents
- Cover
- Series
- Title Page
- Copyright
- Dedication
-
Preface
- What is Data Mining? What is Predictive Analytics?
- Why is this Book Needed?
- Who Will Benefit from this Book?
- Danger! Data Mining is Easy to do Badly
- “White-Box” Approach
- Algorithm Walk-Throughs
- Exciting New Topics
- The R Zone
- Appendix: Data Summarization and Visualization
- The Case Study: Bringing it all Together
- How the Book is Structured
- The Software
- Weka: The Open-Source Alternative
- The Companion Web Site: www.dataminingconsultant.com
- Data Mining and Predictive Analytics as a Textbook
- Acknowledgments
- Part I: Data Preparation
- Chapter 1: An Introduction to Data Mining and Predictive Analytics
-
Chapter 2: Data Preprocessing
- 2.1 Why do We Need to Preprocess the Data?
- 2.2 Data Cleaning
- 2.3 Handling Missing Data
- 2.4 Identifying Misclassifications
- 2.5 Graphical Methods for Identifying Outliers
- 2.6 Measures of Center and Spread
- 2.7 Data Transformation
- 2.8 Min–Max Normalization
- 2.9 Z-Score Standardization
- 2.10 Decimal Scaling
- 2.11 Transformations to Achieve Normality
- 2.12 Numerical Methods for Identifying Outliers
- 2.13 Flag Variables
- 2.14 Transforming Categorical Variables into Numerical Variables
- 2.15 Binning Numerical Variables
- 2.16 Reclassifying Categorical Variables
- 2.17 Adding an Index Field
- 2.18 Removing Variables that are not Useful
- 2.19 Variables that Should Probably not be Removed
- 2.20 Removal of Duplicate Records
- 2.21 A Word About ID Fields
- The R Zone
- R Reference
- Exercises
-
Chapter 3: Exploratory Data Analysis
- 3.1 Hypothesis Testing Versus Exploratory Data Analysis
- 3.2 Getting to Know The Data Set
- 3.3 Exploring Categorical Variables
- 3.4 Exploring Numeric Variables
- 3.5 Exploring Multivariate Relationships
- 3.6 Selecting Interesting Subsets of the Data for Further Investigation
- 3.7 Using EDA to Uncover Anomalous Fields
- 3.8 Binning Based on Predictive Value
- 3.9 Deriving New Variables: Flag Variables
- 3.10 Deriving New Variables: Numerical Variables
- 3.11 Using EDA to Investigate Correlated Predictor Variables
- 3.12 Summary of Our EDA
- The R Zone
- R References
- Exercises
-
Chapter 4: Dimension-Reduction Methods
- 4.1 Need for Dimension-Reduction in Data Mining
- 4.2 Principal Components Analysis
- 4.3 Applying PCA to the Houses Data Set
- 4.4 How Many Components Should We Extract?
- 4.5 Profiling the Principal Components
- 4.6 Communalities
- 4.7 Validation of the Principal Components
- 4.8 Factor Analysis
- 4.9 Applying Factor Analysis to the Adult Data Set
- 4.10 Factor Rotation
- 4.11 User-Defined Composites
- 4.12 An Example of a User-Defined Composite
- The R Zone
- R References
- Exercises
- Part II: Statistical Analysis
-
Chapter 5: Univariate Statistical Analysis
- 5.1 Data Mining Tasks in Discovering Knowledge in Data
- 5.2 Statistical Approaches to Estimation and Prediction
- 5.3 Statistical Inference
- 5.4 How Confident are We in Our Estimates?
- 5.5 Confidence Interval Estimation of the Mean
- 5.6 How to Reduce the Margin of Error
- 5.7 Confidence Interval Estimation of the Proportion
- 5.8 Hypothesis Testing for the Mean
- 5.9 Assessing The Strength of Evidence Against The Null Hypothesis
- 5.10 Using Confidence Intervals to Perform Hypothesis Tests
- 5.11 Hypothesis Testing for The Proportion
- Reference
- The R Zone
- R Reference
- Exercises
- Chapter 6: Multivariate Statistics
- Chapter 7: Preparing to Model the Data
-
Chapter 8: Simple Linear Regression
- 8.1 An Example of Simple Linear Regression
- 8.2 Dangers of Extrapolation
- 8.3 How Useful is the Regression? The Coefficient of Determination, 2
- 8.4 Standard Error of the Estimate,
- 8.5 Correlation Coefficient
- 8.6 Anova Table for Simple Linear Regression
- 8.7 Outliers, High Leverage Points, and Influential Observations
- 8.8 Population Regression Equation
- 8.9 Verifying The Regression Assumptions
- 8.10 Inference in Regression
- 8.11 t-Test for the Relationship Between x and y
- 8.12 Confidence Interval for the Slope of the Regression Line
- 8.13 Confidence Interval for the Correlation Coefficient ρ
- 8.14 Confidence Interval for the Mean Value of Given
- 8.15 Prediction Interval for a Randomly Chosen Value of Given
- 8.16 Transformations to Achieve Linearity
- 8.17 Box–Cox Transformations
- The R Zone
- R References
- Exercises
-
Chapter 9: Multiple Regression and Model Building
- 9.1 An Example of Multiple Regression
- 9.2 The Population Multiple Regression Equation
- 9.3 Inference in Multiple Regression
- 9.4 Regression With Categorical Predictors, Using Indicator Variables
- 9.5 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
- 9.6 Sequential Sums of Squares
- 9.7 Multicollinearity
- 9.8 Variable Selection Methods
- 9.9 Gas Mileage Data Set
- 9.10 An Application of Variable Selection Methods
- 9.11 Using the Principal Components as Predictors in Multiple Regression
- The R Zone
- R References
- Exercises
- Part III: Classification
-
Chapter 10: k-Nearest Neighbor Algorithm
- 10.1 Classification Task
- 10.2 k-Nearest Neighbor Algorithm
- 10.3 Distance Function
- 10.4 Combination Function
- 10.5 Quantifying Attribute Relevance: Stretching the Axes
- 10.6 Database Considerations
- 10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction
- 10.8 Choosing k
- 10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
- The R Zone
- R References
- Exercises
- Chapter 11: Decision Trees
-
Chapter 12: Neural Networks
- 12.1 Input and Output Encoding
- 12.2 Neural Networks for Estimation and Prediction
- 12.3 Simple Example of a Neural Network
- 12.4 Sigmoid Activation Function
- 12.5 Back-Propagation
- 12.6 Gradient-Descent Method
- 12.7 Back-Propagation Rules
- 12.8 Example of Back-Propagation
- 12.9 Termination Criteria
- 12.10 Learning Rate
- 12.11 Momentum Term
- 12.12 Sensitivity Analysis
- 12.13 Application of Neural Network Modeling
- The R Zone
- R References
- Exercises
-
Chapter 13: Logistic Regression
- 13.1 Simple Example of Logistic Regression
- 13.2 Maximum Likelihood Estimation
- 13.3 Interpreting Logistic Regression Output
- 13.4 Inference: Are the Predictors Significant?
- 13.5 Odds Ratio and Relative Risk
- 13.6 Interpreting Logistic Regression for a Dichotomous Predictor
- 13.7 Interpreting Logistic Regression for a Polychotomous Predictor
- 13.8 Interpreting Logistic Regression for a Continuous Predictor
- 13.9 Assumption of Linearity
- 13.10 Zero-Cell Problem
- 13.11 Multiple Logistic Regression
- 13.12 Introducing Higher Order Terms to Handle Nonlinearity
- 13.13 Validating the Logistic Regression Model
- 13.14 WEKA: Hands-On Analysis Using Logistic Regression
- The R Zone
- R References
- Exercises
-
Chapter 14: NaÏVe Bayes and Bayesian Networks
- 14.1 Bayesian Approach
- 14.2 Maximum A Posteriori (MAP) Classification
- 14.3 Posterior Odds Ratio
- 14.4 Balancing The Data
- 14.5 Naïve Bayes Classification
- 14.6 Interpreting The Log Posterior Odds Ratio
- 14.7 Zero-Cell Problem
- 14.8 Numeric Predictors for Naïve Bayes Classification
- 14.9 WEKA: Hands-on Analysis Using Naïve Bayes
- 14.10 Bayesian Belief Networks
- 14.11 Clothing Purchase Example
- 14.12 Using The Bayesian Network to Find Probabilities
- The R Zone
- R References
- Exercises
-
Chapter 15: Model Evaluation Techniques
- 15.1 Model Evaluation Techniques for the Description Task
- 15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
- 15.3 Model Evaluation Measures for the Classification Task
- 15.4 Accuracy and Overall Error Rate
- 15.5 Sensitivity and Specificity
- 15.6 False-Positive Rate and False-Negative Rate
- 15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives
- 15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns
- 15.9 Decision Cost/Benefit Analysis
- 15.10 Lift Charts and Gains Charts
- 15.11 Interweaving Model Evaluation with Model Building
- 15.12 Confluence of Results: Applying a Suite of Models
- The R Zone
- R References
- Exercises
- Hands-On Analysis
-
Chapter 16: Cost-Benefit Analysis Using Data-Driven Costs
- 16.1 Decision Invariance Under Row Adjustment
- 16.2 Positive Classification Criterion
- 16.3 Demonstration Of The Positive Classification Criterion
- 16.4 Constructing The Cost Matrix
- 16.5 Decision Invariance Under Scaling
- 16.6 Direct Costs and Opportunity Costs
- 16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
- 16.8 Rebalancing as a Surrogate for Misclassification Costs
- The R Zone
- R References
- Exercises
-
Chapter 17: Cost-Benefit Analysis for Trinary and -Nary Classification Models
- 17.1 Classification Evaluation Measures for a Generic Trinary Target
- 17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
- 17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
- 17.4 Comparing Cart Models With and Without Data-Driven Misclassification Costs
- 17.5 Classification Evaluation Measures for a Generic k-Nary Target
- 17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
- The R Zone
- R References
- Exercises
- Chapter 18: Graphical Evaluation of Classification Models
- Part IV: Clustering
-
Chapter 19: Hierarchical and -Means Clustering
- 19.1 The Clustering Task
- 19.2 Hierarchical Clustering Methods
- 19.3 Single-Linkage Clustering
- 19.4 Complete-Linkage Clustering
- 19.5 -Means Clustering
- 19.6 Example of -Means Clustering at Work
- 19.7 Behavior of MSB, MSE, and Pseudo-F as the -Means Algorithm Proceeds
- 19.8 Application of -Means Clustering Using SAS Enterprise Miner
- 19.9 Using Cluster Membership to Predict Churn
- The R Zone
- R References
- Exercises
- Hands-On Analysis
-
Chapter 20: Kohonen Networks
- 20.1 Self-Organizing Maps
- 20.2 Kohonen Networks
- 20.3 Example of a Kohonen Network Study
- 20.4 Cluster Validity
- 20.5 Application of Clustering Using Kohonen Networks
- 20.6 Interpreting The Clusters
- 20.7 Using Cluster Membership as Input to Downstream Data Mining Models
- The R Zone
- R References
- Exercises
-
Chapter 21: BIRCH Clustering
- 21.1 Rationale for BIRCH Clustering
- 21.2 Cluster Features
- 21.3 Cluster Feature TREE
- 21.4 Phase 1: Building The CF Tree
- 21.5 Phase 2: Clustering The Sub-Clusters
- 21.6 Example of Birch Clustering, Phase 1: Building The CF Tree
- 21.7 Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
- 21.8 Evaluating The Candidate Cluster Solutions
- 21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
- The R Zone
- R References
- Exercises
-
Chapter 22: Measuring Cluster Goodness
- 22.1 Rationale for Measuring Cluster Goodness
- 22.2 The Silhouette Method
- 22.3 Silhouette Example
- 22.4 Silhouette Analysis of the IRIS Data Set
- 22.5 The Pseudo-F Statistic
- 22.6 Example of the Pseudo-F Statistic
- 22.7 Pseudo-F Statistic Applied to the IRIS Data Set
- 22.8 Cluster Validation
- 22.9 Cluster Validation Applied to the Loans Data Set
- The R Zone
- R References
- Exercises
- Part V: Association Rules
-
Chapter 23: Association Rules
- 23.1 Affinity Analysis and Market Basket Analysis
- 23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property
- 23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
- 23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
- 23.5 Extension From Flag Data to General Categorical Data
- 23.6 Information-Theoretic Approach: Generalized Rule Induction Method
- 23.7 Association Rules are Easy to do Badly
- 23.8 How Can We Measure the Usefulness of Association Rules?
- 23.9 Do Association Rules Represent Supervised or Unsupervised Learning?
- 23.10 Local Patterns Versus Global Models
- The R Zone
- R References
- Exercises
- Part VI: Enhancing Model Performance
- Chapter 24: Segmentation Models
- Chapter 25: Ensemble Methods: Bagging and Boosting
- Chapter 26: Model Voting and Propensity Averaging
- Part VII: Further Topics
-
Chapter 27: Genetic Algorithms
- 27.1 Introduction To Genetic Algorithms
- 27.2 Basic Framework of a Genetic Algorithm
- 27.3 Simple Example of a Genetic Algorithm at Work
- 27.4 Modifications and Enhancements: Selection
- 27.5 Modifications and Enhancements: Crossover
- 27.6 Genetic Algorithms for Real-Valued Variables
- 27.7 Using Genetic Algorithms to Train a Neural Network
- 27.8 WEKA: Hands-On Analysis Using Genetic Algorithms
- The R Zone
- R References
- Chapter 28: Imputation of Missing Data
- Part VIII: Case Study: Predicting Response to Direct-Mail Marketing
- Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA
-
Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis
- 30.1 Partitioning the Data
- 30.2 Developing the Principal Components
- 30.3 Validating the Principal Components
- 30.4 Profiling the Principal Components
- 30.5 Choosing the Optimal Number of Clusters Using Birch Clustering
- 30.6 Choosing the Optimal Number of Clusters Using k-Means Clustering
- 30.7 Application of k-Means Clustering
- 30.8 Validating the Clusters
- 30.9 Profiling the Clusters
-
Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability
- 31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
- 31.2 Modeling And Evaluation Overview
- 31.3 Cost-Benefit Analysis Using Data-Driven Costs
- 31.4 Variables to be Input To The Models
- 31.5 Establishing The Baseline Model Performance
- 31.6 Models That Use Misclassification Costs
- 31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs
- 31.8 Combining Models Using Voting and Propensity Averaging
- 31.9 Interpreting The Most Profitable Model
- Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only
- Appendix A: Data Summarization and Visualization
- Index
- End User License Agreement
Product information
- Title: Data Mining and Predictive Analytics, 2nd Edition
- Author(s):
- Release date: March 2015
- Publisher(s): Wiley
- ISBN: 9781118116197
You might also like
book
Predictive Analytics and Data Mining
Put Predictive Analytics into ActionLearn the basics of Predictive Analysis and Data Mining through an easy …
book
Data Mining for Business Analytics
Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python presents an applied approach to …
book
Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition
The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business …
book
Practical Predictive Analytics
Make sense of your data and predict the unpredictable About This Book A unique book that …