When we build a deep neural network, we have many parameters. Parameters are basically the weights of the network, so when we build a network with many layers, we will have many weights, say, . Our goal is to find the optimal values for all these weights. In all of the previous methods we learned about, the learning rate was a common value for all the parameters of the network. However Adagrad (short for adaptive gradient) adaptively sets the learning rate according to a parameter.
Parameters that have frequent updates or high gradients will have a slower learning rate, while a parameter that ...