Optimization
Optimization is a critical part of learning. We use optimization to minimize the objective functions (the error function) to learn the correct network weights and structures.
While many advanced optimization algorithms have been developed, the most common optimization approach is still Stochastic Gradient Descent (SGD) and its variations, for example, momentum-based methods, AdaGrad, Adam, and RMSProp. We will mainly base our discussion in this section on SGD. Different from traditional gradient descent approaches in which the parameters are updated once by computing on all the training samples, SGD simply updates and computes the gradient of the parameters using only a single or a few training examples. It is often recommended ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access