Understanding Batches
Mini-batch GD feels counter-intuitive. Why do smaller batches result in faster training? The answer is that they don’t: if anything, mini-batch GD is generally slower than batch GD at processing the whole training set because it calculates the gradient for each batch, rather than once for all the examples.
Even if mini-batch GD is slower, it tends to converge faster during the first iterations of training. In other words, mini-batch GD is slower at processing the training set, but it moves quicker toward the target, giving us that fast feedback we need. Let’s see how.
Twist That Path
To see why mini-batches converge faster, I visualized gradient descent on a small two-dimensional training set. As usual, you’ll find the ...
Get Programming Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.