Optimizer
Next, we define the optimizer, which is based on the Adam optimizer.
Adam is different to the stochastic gradient descent algorithm. Stochastic gradient descent maintains a single learning rate (called alpha) for all weight updates and the learning rate does not change during training.
This algorithm maintains a learning rate for each network weight (parameter) and separately adapts as learning unfolds. It computes individual adaptive learning rates for different parameters from the estimates of the first and second moments of the gradients.
Adam combines the advantages of two other extensions of stochastic gradient descent.
The adaptive gradient algorithm (AdaGrad) maintains a per-parameter learning rate that improves performance ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access