8Probability Distributions in
Data analysis in essence revolves around attempting to understand as much of the variability in our data as possible. The data used could be numeric (e.g. height) or categorical (e.g. ethnicity), see Section 1.8.1 for more information, but either way describing the nature of the variability in the data from one observation to the next is done via probability distributions.
Before we consider probability distributions, we must think about random variables: informally, a random variable is a quantity that varies randomly from unit to unit. For example, height will naturally vary from person to person, while the outcome of a throw of a die will vary from throw to throw. The probability distribution for height or the outcome of a throw of a die then describes the likelihood of an event, occurrence, or outcome of interest. From there we can ask questions such as ‘what's the probability (how likely is it) that a person chosen at random from a particular population will be taller than 170 cm?’, or ‘what are the chances that a six thrown on the next roll of a die?’.
It is useful to have notation so that we can write statements involving random variables succinctly and without ambiguity. It is customary to write random variables using capital letters – this tells us that the value that it can take will vary randomly from unit to unit. Let us say, for example, ...
Get The R Book, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.