This section will talk about two topics that form the mathematical and computational underpinnings of much of what we've covered in this book. The goal is to help you frame novel problems in a way that makes theoretical sense and that can realistically be solved with a computer.
23.1 Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is a very general way to frame a large class of problems in data science:
- You have a probability distribution characterized by some parameters that we'll call θ. In a regular normal distribution, for example, θ would consist of just two numbers: the mean and the standard deviation.
- You assume that a real-world process is described by a probability distribution from this family, but you do not make any assumptions about θ.
- You have a dataset called X that is drawn from the real-world process.
- You find the θ that maximizes the probability P(X|θ).
A large fraction of machine learning classification and regression models all fall under this umbrella. They differ widely in the functional form they assume, but they all assume one at least implicitly. Mathematically, the process of “training the model” really reduces to calculating θ.
In MLE problems, we almost always assume that the different data points in X are independent of each other. That is, if there are N data points, then we assume
In practice, it is often easier to find θ that maximizes the log of the probability, rather ...