Chapter 19Statistics

I should start off with an explanatory note. A lot of data science really should be considered a subset of statistics. It is largely a matter of historical accident that statistics, data science, and machine learning are seen as different things. The disciplines have evolved largely independently, focusing on very different problems, so they have become different enough that I treat them as separate things in this book.

Most data scientists, most of the time, don't really need a thorough knowledge of statistics. There are some who live and breathe it, to be sure, but it's not nearly as useful for data science as one might expect. What's absolutely crucial, however, is the kind of critical thinking that one usually learns in a statistics class. Statistics is all about being extremely, painstakingly careful and rigorous in how we analyze data and the assumptions we make. Data science focuses more on how to extract features out of data, and there is usually enough data available that we don't need to be so exceedingly careful. But data scientists need to be sensitive to the luxury provided by having a lot of data and able to break out more rigorous methods when the data is lacking.

This chapter will cover several of the key topics in statistics. In each case, it will focus on the key ideas, insights, and assumptions underlying each topic, rather than rigorous derivations of each formula.

19.1 Statistics in Perspective

It might seem absurd that most data scientists ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.