Chapter 8. Gradient Descent
Those who boast of their descent, brag on what they owe to others.
Frequently when doing data science, we’ll be trying to the find the best model for a certain situation. And usually “best” will mean something like “minimizes the error of the model” or “maximizes the likelihood of the data.” In other words, it will represent the solution to some sort of optimization problem.
This means we’ll need to solve a number of optimization problems. And in particular, we’ll need to solve them from scratch. Our approach will be a technique called gradient descent, which lends itself pretty well to a from-scratch treatment. You might not find it super exciting in and of itself, but it will enable us to do exciting things throughout the book, so bear with me.
The Idea Behind Gradient Descent
Suppose we have some function
f that takes as input a vector of real numbers and outputs a single real number. One simple such function is:
"""computes the sum of squared elements in v"""
We’ll frequently need to maximize (or minimize) such functions. That is, we need to find the input
v that produces the largest (or smallest) possible value.
For functions like ours, the gradient (if you remember your calculus, this is the vector of partial derivatives) gives the input direction in which the function most quickly increases. (If you don’t remember your calculus, take my word for it or look it up on the Internet.) ...