January 2018
Beginner to intermediate
284 pages
8h 35m
English
Note that in SGD, we often perform a weights update using mini-batch, which is usually between 32/64-256 samples. Compare this to batch gradient descent, which computes the gradient over the entire training data. SGD takes less memory and is not very prone to landing on a really bad spot (saddle point), as the noise carried by the small sample set helps the escape of the local minima. Compared to pure SGD, which updates parameters by the gradient computed on a single instance of the dataset, mini-batch SGD is also more stable and more efficient (with a relatively fast convergence) than looping over the entire dataset one sample each time.
Read now
Unlock full access