An SGD implementation of gradient descent uses a simple distributed sampling of the data examples. Loss is a part of the optimization problem, and therefore, is a true sub-gradient.
This requires access to the full dataset, which is not optimal.
The parameter miniBatchFraction specifies the fraction of the full data to use. The average of the gradients over this subset
is a stochastic gradient. S is a sampled subset of size ...