Chapter 3. Statistics

Applying the basic principles of statistics to data science provides vital insight into our data. Statistics is a powerful tool. Used correctly, it enables us to be sure of our decision-making process. However, it is easy to use statistics incorrectly. One example is Anscombe’s quartet (Figure 3-1), which demonstrates how four distinct datasets can have nearly identical statistics. In many cases, a simple plot of the data can alert us right away to what is really going on with the data. In the case of Anscombe’s quartet, we can instantly pick out these features: in the upper-left panel, x and appear to be linear, but noisy. In the upper-right panel, we see that x and y form a peaked relationship that is nonlinear. In the lower-left panel, x and y are precisely linear, except for one outlier. The lower-right panel shows that is statistically distributed for and that there ...

Get Data Science with Java now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.