8Probability Distributions in 
Data analysis in essence revolves around attempting to understand as much of the variability in our data as possible. The data used could be numeric (e.g. height) or categorical (e.g. ethnicity), see Section 1.8.1 for more information, but either way describing the nature of the variability in the data from one observation to the next is done via probability distributions.
Before we consider probability distributions, we must think about random variables: informally, a random variable is a quantity that varies randomly from unit to unit. For example, height will naturally vary from person to person, while the outcome of a throw of a die will vary from throw to throw. The probability distribution for height or the outcome of a throw of a die then describes the likelihood of an event, occurrence, or outcome of interest. From there we can ask questions such as ‘what's the probability (how likely is it) that a person chosen at random from a particular population will be taller than 170 cm?’, or ‘what are the chances that a six thrown on the next roll of a die?’.
It is useful to have notation so that we can write statements involving random variables succinctly and without ambiguity. It is customary to write random variables using capital letters – this tells us that the value that it can take will vary randomly from unit to unit. Let us say, for example, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access