Chapter 3Exploratory Data Analysis

3.1 Hypothesis Testing Versus Exploratory Data Analysis

When approaching a data mining problem, a data mining analyst may already have some a priori hypotheses that he or she would like to test regarding the relationships between the variables. For example, suppose that cell-phone executives are interested in whether a recent increase in the fee structure has led to a decrease in market share. In this case, the analyst would test the hypothesis that market share has decreased, and would therefore use hypothesis testing procedures.

A myriad of statistical hypothesis testing procedures are available through the traditional statistical analysis literature. We cover many of these in Chapters 5 and 6. However, analysts do not always have a priori notions of the expected relationships among the variables. Especially when confronted with unknown, large databases, analysts often prefer to use exploratory data analysis (EDA), or graphical data analysis. EDA allows the analyst to

  • delve into the data set;
  • examine the interrelationships among the attributes;
  • identify interesting subsets of the observations;
  • develop an initial idea of possible associations amongst the predictors, as well as between the predictors and the target variable.

3.2 Getting to Know The Data Set

Graphs, plots, and tables often uncover important relationships that could indicate important areas for further investigation. In Chapter 3, we use exploratory methods to delve into ...

Get Data Mining and Predictive Analytics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.