December 2018
Beginner to intermediate
684 pages
21h 9m
English
AdaGrad accumulates all historical, parameter-specific gradient information and continues to rescale the learning rate inversely proportional to the squared cumulate gradient for a given parameter. The goal is to slow down changes for parameters that have changed a lot and to encourage adjustments for those that don't.
AdaGrad is designed to perform well on convex functions and has had mixed performance in a DL contexts because it can reduce the learning rate too fast based on early gradient information.