Let's say we have some task, T. We use a model, , parameterized by some parameter, , and train the model to minimize the loss. We minimize the loss using gradient descent and find the optimal parameter for the model.
Let's recall the update rule of a gradient descent:
So, what are the key elements that make up our gradient descent? Let's ...