book

Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition

Name: Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition
Author: Daniel T. Larose
ISBN: 9780470908747

by Daniel T. Larose

July 2014

Beginner to intermediate

336 pages

9h 30m

English

Wiley

Read now

Unlock full access

Preface
What is Data Mining?Why is This Book Needed?What's New for the Second Edition?Danger! Data Mining is Easy to Do Badly“White Box” Approach: Understanding the Underlying Algorithmic and Model StructuresData Mining as a ProcessGraphical Approach, Emphasizing Exploratory Data AnalysisHow The Book is StructuredAcknowledgments
Chapter 1: An Introduction to Data Mining
1.1 What is Data Mining?1.2 Wanted: Data Miners1.3 The Need for Human Direction of Data Mining1.4 The Cross-Industry Standard Practice for Data Mining1.5 Fallacies of Data Mining1.6 What Tasks Can Data Mining Accomplish?ReferencesExercisesNote
Chapter 2: Data Preprocessing
2.1 Why do We Need to Preprocess the Data?2.2 Data Cleaning2.3 Handling Missing Data2.4 Identifying Misclassifications2.5 Graphical Methods for Identifying Outliers2.6 Measures of Center and Spread2.7 Data Transformation2.8 Min-Max Normalization2.9 Z-Score Standardization2.10 Decimal Scaling2.11 Transformations to Achieve Normality2.12 Numerical Methods for Identifying Outliers2.13 Flag Variables2.14 Transforming Categorical Variables into Numerical Variables2.15 Binning Numerical Variables2.16 Reclassifying Categorical Variables2.17 Adding an Index Field2.18 Removing Variables that are Not Useful2.19 Variables that Should Probably Not Be Removed2.20 Removal of Duplicate Records2.21 A Word About Id FieldsReferencesExercisesHands-On AnalysisNotes
Chapter 3: Exploratory Data Analysis
3.1 Hypothesis Testing Versus Exploratory Data Analysis3.2 Getting to Know the Data Set3.3 Exploring Categorical Variables3.4 Exploring Numeric Variables3.5 Exploring Multivariate Relationships3.6 Selecting Interesting Subsets of the Data for Further Investigation3.7 Using EDA to Uncover Anomalous Fields3.8 Binning Based on Predictive Value3.9 Deriving New Variables: Flag Variables3.10 Deriving New Variables: Numerical Variables3.11 Using EDA to Investigate Correlated Predictor Variables3.12 SummaryReferenceExercisesHands-On AnalysisNote
Chapter 4: Univariate Statistical Analysis
4.1 Data Mining Tasks in Discovering Knowledge in Data4.2 Statistical Approaches to Estimation and Prediction4.3 Statistical Inference4.4 How Confident are We in Our Estimates?4.5 Confidence Interval Estimation of the Mean4.6 How to Reduce the Margin of Error4.7 Confidence Interval Estimation of the Proportion4.8 Hypothesis Testing for the Mean4.9 Assessing the Strength of Evidence Against the Null Hypothesis4.10 Using Confidence Intervals to Perform Hypothesis Tests4.11 Hypothesis Testing for the ProportionReferenceExercises
Chapter 5: Multivariate Statistics
5.1 Two-Sample t-Test for Difference in Means5.2 Two-Sample Z-Test for Difference in Proportions5.3 Test for Homogeneity of Proportions5.4 Chi-Square Test for Goodness of Fit of Multinomial Data5.5 Analysis of Variance5.6 Regression Analysis5.7 Hypothesis Testing in Regression5.8 Measuring the Quality of a Regression Model5.9 Dangers of Extrapolation5.10 Confidence Intervals for the Mean Value of y Given x5.11 Prediction Intervals for a Randomly Chosen Value of y Given x5.12 Multiple Regression5.13 Verifying Model AssumptionsReferenceExercisesHands-On AnalysisNote
Chapter 6: Preparing to Model the Data
6.1 Supervised Versus Unsupervised Methods6.2 Statistical Methodology and Data Mining Methodology6.3 Cross-Validation6.4 Overfitting6.5 BIAS–Variance Trade-Off6.6 Balancing the Training Data Set6.7 Establishing Baseline PerformanceReferenceExercises
Chapter 7: k-Nearest Neighbor Algorithm
7.1 Classification Task7.2 k-Nearest Neighbor Algorithm7.3 Distance Function7.4 Combination Function7.5 Quantifying Attribute Relevance: Stretching the Axes7.6 Database Considerations7.7 k-Nearest Neighbor Algorithm for Estimation and Prediction7.8 Choosing k7.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS ModelerExercisesHands-On Analysis
Chapter 8: Decision Trees
8.1 What is a Decision Tree?8.2 Requirements for Using Decision Trees8.3 Classification and Regression Trees8.4 C4.5 Algorithm8.5 Decision Rules8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real DataReferencesExercisesHands-On Analysis
Chapter 9: Neural Networks
9.1 Input and Output Encoding9.2 Neural Networks for Estimation and Prediction9.3 Simple Example of a Neural Network9.4 Sigmoid Activation Function9.5 Back-Propagation9.6 Termination Criteria9.7 Learning Rate9.8 Momentum Term9.9 Sensitivity Analysis9.10 Application of Neural Network ModelingReferencesExercisesHands-On Analysis

Chapter 10: Hierarchical and k-Means Clustering
10.1 The Clustering Task10.2 Hierarchical Clustering Methods10.3 Single-Linkage Clustering10.4 Complete-Linkage Clustering10.5 k-Means Clustering10.6 Example of k-Means Clustering at Work10.7 Behavior of MSB, MSE, and PSEUDO-F as the k-Means Algorithm Proceeds10.8 Application of k-Means Clustering Using SAS Enterprise Miner10.9 Using Cluster Membership to Predict ChurnReferencesExercisesHands-On AnalysisNote
Chapter 11: Kohonen Networks
11.1 Self-Organizing Maps11.2 Kohonen Networks11.3 Example of a Kohonen Network Study11.4 Cluster Validity11.5 Application of Clustering Using Kohonen Networks11.6 Interpreting the Clusters11.7 Using Cluster Membership as Input to Downstream Data Mining ModelsReferencesExercisesHands-On Analysis
Chapter 12: Association Rules
12.1 Affinity Analysis and Market Basket Analysis12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property12.3 How Does the a Priori Algorithm Work?12.4 Extension from Flag Data to General Categorical Data12.5 Information-Theoretic Approach: Generalized Rule Induction Method12.6 Association Rules are Easy to do Badly12.7 How can we Measure the Usefulness of Association Rules?12.8 Do Association Rules Represent Supervised or Unsupervised Learning?12.9 Local Patterns Versus Global ModelsReferencesExercisesHands-On Analysis
Chapter 13: Imputation of Missing Data
13.1 Need for Imputation of Missing Data13.2 Imputation of Missing Data: Continuous Variables13.3 Standard Error of the Imputation13.4 Imputation of Missing Data: Categorical Variables13.5 Handling Patterns in MissingnessReferenceExercisesHands-On AnalysisNotes
Chapter 14: Model Evaluation Techniques
14.1 Model Evaluation Techniques for the Description Task14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks14.3 Model Evaluation Techniques for the Classification Task14.4 Error Rate, False Positives, and False Negatives14.5 Sensitivity and Specificity14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns14.7 Decision Cost/Benefit Analysis14.8 Lift Charts and Gains Charts14.9 Interweaving Model Evaluation with Model Building14.10 Confluence of Results: Applying a Suite of ModelsReferenceExercisesHands-On AnalysisNotes
Appendix: Data Summarization and Visualization
Part 1 Summarization 1: Building Blocks of Data AnalysisPart 2 Visualization: Graphs and Tables for Summarizing and Organizing DataPart 3 Summarization 2: Measures of Center, Variability, and PositionPart 4 Summarization and Visualization of Bivariate Relationships
Index
End User License Agreement

Content preview from Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition

Preface

What is Data Mining?

According to the Gartner Group,

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Today, there are a variety of terms used to describe this process, including analytics, predictive analytics, big data, machine learning, and knowledge discovery in databases. But these terms all share in common the objective of mining actionable nuggets of knowledge from large data sets. We shall therefore use the term data mining to represent this process throughout this text.

Why is This Book Needed?

Humans are inundated with data in most fields. Unfortunately, these valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of these data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.

The McKinsey Global Institute reports:¹

There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining, 2nd Edition

Publisher Resources

ISBN: 9781118873571Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design