Stochastic Gradient Descent

Almost all neural network learning is powered by one very important algorithm: SGD. This is an extension of the normal gradient-descent algorithm. In ML, the loss function is often written as a sum over per-example loss function as the squared error E in the cafeteria example. Thus, if we have m training examples, the gradient function will also have m additive terms.

The computational cost of the gradient increases linearly with m. For a billion-size training set, the preceding gradient computation will take very long time and the gradient-descent algorithm will proceed very slowly toward convergence, making learning impossible in practice.

SGD depends on a simple insight that the gradient is actually an expectation. ...

Get Hands-On Transfer Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.