Chapter 9Exploratory and Predictive Data Analysis

We are overwhelmed by information, not because there is too much, but because we don’t know how to tame it. Information lies stagnant in rapidly expanding pools as our ability to collect and warehouse it increases, but our ability to make sense of and communicate it remains inert, largely without notice.

Stephen Few, Now You See It

Exploratory data analysis is an approach to analyzing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses. It was so named by John Tukey to contrast with confirmatory data analysis, the term used for the set of ideas about hypothesis testing, p-values, and confidence intervals (CIs).

Tukey suggested that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); essentially more emphasis had to be placed on enabling data to suggest hypotheses worth testing (exploratory data analysis). We must not muddle the two types of analyses; formulating workflows that convolve them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

The exploratory phase “isolates patterns and features of the data and reveals these forcefully to the analyst.”1 If a model is fit to the data, exploratory analysis finds patterns that represent deviations from the model. These patterns lead the analyst to revise the model via ...

Get Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.