Batch by Batch
The style of gradient descent that we used so far is also called batch gradient descent, because it clusters all the training examples into one big batch, and calculates the gradient of the loss over the entire batch. A common alternative is called mini-batch gradient descent. Maybe you already guessed what it does: it segments the training set into smaller batches, and then takes a step of gradient descent for each batch.
You might wonder how small batches help speed up training. Stick with me for a moment: let’s implement mini-batch GD and give it a test drive.
Implementing Batches
In most cases, we should shuffle a dataset before we split it into batches. That way, we’re sure that each batch contains a nice mix of examples—as ...
Get Programming Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.