March 2020
Beginner to intermediate
342 pages
8h 38m
English
Mini-batch GD feels counter-intuitive. Why do smaller batches result in faster training? The answer is that they don’t: if anything, mini-batch GD is generally slower than batch GD at processing the whole training set because it calculates the gradient for each batch, rather than once for all the examples.
Even if mini-batch GD is slower, it tends to converge faster during the first iterations of training. In other words, mini-batch GD is slower at processing the training set, but it moves quicker toward the target, giving us that fast feedback we need. Let’s see how.
To see why mini-batches converge faster, I visualized gradient descent on a small two-dimensional training set. As usual, you’ll find the ...