Stochastic gradient descent
We can further optimize the training process with a simple change. With basic (or batch) gradient descent, we calculate the adjustment by looking at the entire dataset. Therefore, the next obvious step for optimization is: can we calculate the adjustment by looking at less than the entire dataset?
As it turns out, the answer is yes! As we are expecting to train the network over numerous iterations, we can take advantage of the fact that we expect the gradient to be updated multiple times by calculating it for fewer examples. We can even do it by calculating it for a single example. By performing fewer calculations for each network update, we can significantly reduce the amount of computation required, meaning faster ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access