Chapter 18Probability

So far, this book has tacitly assumed that you understand basic probability, such as the notion of independence and what an average is. This chapter will go into more detail, giving you a little bit of theoretical background in the subject and an overview of the standard tools. In practice, data scientists only need a moderate amount of probability theory for most of their daily work, but that moderate amount is crucially important. Probability provides the theoretical basis for almost all of machine learning and most of analytics, and it is a critical mindset for data scientists to be able to adopt.

Probability is often confused with statistics. The way I would break it down is to say that probability is a collection of techniques for describing the world using mathematical models that include randomness. In particular, probability focuses on what you can derive about the world assuming that it is well described by one of these models. For example, if we assume a certain distribution of human heights, then how many people in a crowd can we expect to be over 5 ft tall? Statistics is more about working backward: given some real-world data, what can we infer about the real-world process (which we imagine to be some probability model) that generated it?

This chapter will attempt to build up the subject of probability in a very intuitive way. I will start off by showing two of the simplest, most intuitive, and most important probability models. Using these ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.